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Data  mining  on  large  data  warehouses  is  becoming  increasingly  important.  In  support 
of  this  trend,  we  consider  a  spectrum  of  architectural  alternatives  for  integrating  mining 
with  database  systems.  These  alternatives  include  loose-coupling  through  a  SQL  cursor 
interface;  encapsulation  of  the  mining  algorithm  in  a  stored  procedure;  caching  the  data  to 
a  file  system  on-the-fly  and  mining;  tight-coupling  using  primarily  user-defined  functions; 
and  SQL  implementations  for  processing  in  the  DBMS.  First,  we  comprehensively  study 
the  option  of  expressing  the  association  rule  mining  algorithm  in  the  form  of  SQL  queries. 
We  consider  four  options  in  SQL-92  and  six  options  in  SQL  enhanced  with  object-relational 
extensions  (SQL-OR).  Our  evaluation  of  the  different  architectural  alternatives  shows  that 
from  a  performance  perspective,  the  Cache-Mine  option  is  superior,  although  the  SQL- 
OR  option  comes  a  close  second.  Both  the  Cache-Mine  and  the  SQL-OR  approaches 
incur  a  higher  storage  penalty  than  the  loose-coupling  approach  which  performance-wise 
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is  a  factor  of  3  to  4  worse  than  Cache-Mine.  We  also  compare  these  alternatives  on  the 
basis  of  qualitative  factors  like  automatic  parallelization,  development  ease,  portability  and 
interoperability. 

We  further  analyze  the  SQL-92  approaches  with  the  twin  goals  of  studying  how  best 
can  a  DBMS  without  any  object-relational  extensions  execute  these  queries  and  to  identify 
ways  of  incorporating  the  semantics  of  mining  into  cost-based  query  optimizers.  We  develop 
cost  formulae  for  the  mining  queries  based  on  the  input  data  parameters  and  relational 
operator  costs.  We  also  identify  certain  optimizations  which  improve  the  performance. 
Next,  we  study  generalized  association  rule  and  sequential  pattern  mining  and  develop 
SQL  formulations  for  them  there  by  demonstrating  that  more  complex  mining  operations 
can  be  handled  in  the  SQL  frame  work. 

We  develop  an  incremental  association  rule  mining  algorithm  which  does  not  need 
to  examine  the  old  data  if  the  frequent  itemsets  do  not  change.  Even  otherwise,  access 
to  the  old  database  can  be  limited  to  just  one  scan.  We  categorize  the  various  kinds  of 
constraints  on  the  items  that  are  useful  in  the  context  of  interactive  mining  to  facilitate  goal- 
oriented  mining.  We  show  how  the  incremental  mining  technique  can  be  adapted  to  handle 
constraints  and  certain  kinds  of  constraint  relaxation.  We  also  show  the  applicability  of 
the  incremental  algorithm  to  other  classes  of  data  mining  and  decision  support  problems. 
Finally,  we  identify  certain  primitive  operators  that  are  useful  for  a  large  class  of  data 
mining  and  decision  support  applications.  Supporting  them  natively  in  the  DBMS  could 
enable  these  applications  to  run  faster. 
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CHAPTER  1 
INTRODUCTION 


The  rapid  growth  in  data  warehousing  technology  combined  with  the  significant  drop 
in  storage  prices  has  made  it  possible  to  collect  large  volumes  of  data  about  customer 
transactions  in  retail  stores,  mail  order  companies,  banks,  stock  markets,  telecommunica- 
tion companies  and  so  on.  For  example,  AT&T  call  records  are  about  1  giga  byte  per 
hour  [73]  and  super  market  chains  like  WalMart  collect  tera  bytes  of  data.  In  order  to 
transform  this  huge  amounts  of  data  into  business  competitiveness  and  profits,  it  is  ex- 
tremely important  to  be  able  to  mine  nuggets  of  useful  and  understandable  information 
from  these  data  warehouses.  In  this  chapter,  we  introduce  data  warehousing  and  data  min- 
ing technologies  in  Sections  1.1  and  1.2  respectively  and  in  Section  1.3,  we  motivate  the 
need  for  coupling  the  two  which  is  the  focus  of  this  dissertation.  In  Section  1.4,  we  discuss 
and  list  the  specific  problems  addressed  in  this  dissertation  and  in  Section  1.5,  we  outline 
the  dissertation  organization. 

1.1     Data  Warehousing 

A  data  warehouse  is  simply  a  single,  complete  and  consistent  store  of  data  obtained 
from  a  variety  of  sources  and  made  available  to  end  users  in  a  way  they  can  understand 
and  use.  Achieving  completeness  and  consistency  of  data  in  today's  information  systems 
environment  is  far  from  simple.  The  first  problem  is  to  discover  how  completeness  and 
consistency  can  be  defined.  In  the  business  context,  this  entails  understanding  the  business 
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strategies  and  the  data  required  to  support  and  track  their  achievement.  This  process- 
called  enterprise  modeling — requires  substantial  involvement  of  business  users  and  is  tra- 
ditionally a  long-term  process.  A  data  warehouse  consists  of  several  components  and  tools 
which  are  depicted  in  Figure  1.1  [34]. 
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Figure  1.1.  Data  warehousing  architecture 


Knowing  what  data  are  required  is  just  the  first  step.  These  data  exist  today  in 
various  sources  on  different  platforms,  and  must  be  copied  from  these  sources  for  use  in  the 
warehouse.  They  must  be  combined  according  to  the  enterprise  model,  even  though  it  was 
not  originally  designed  to  support  such  integration.  They  must  be  cleansed  of  structural  and 
content  errors.  This  step — typically  known  as  data  warehouse  population — requires  tools 
for  extracting  data  from  multiple  operational  databases  and  external  sources,  for  cleaning, 
transforming  and  integrating  these  data;  and  for  loading  data  into  the  data  warehouse. 
In  addition  to  the  main  warehouse,  there  may  be  several  departmental  data  marts.  It 
also  requires  tools  for  periodically  refreshing  the  warehouse  to  maintain  consistency  and  to 
purge  data  from  the  warehouse,  perhaps  onto  slower  archival  storage. 


In  order  to  understand  and  profitably  use  the  data  in  a  business  context,  they  must 
be  transformed  into  information.  The  warehouse  data  are  stored  and  managed  by  one  or 
more  warehouse  servers,  which  present  multidimensional  views  of  the  data  to  a  variety  of 
front  end  tools:  query  tools,  report  writers,  analysis  tools  and  data  mining  tools.  There  is 
also  a  repository  which  stores  metadata  derived  using  the  enterprise  model,  and  tools  for 
monitoring  and  administering  the  warehousing  system. 

The  warehouse  may  be  distributed  for  load  balancing,  scalability  and  higher  availabil- 
ity. In  such  a  distributed  architecture,  the  metadata  repository  is  usually  replicated  with 
each  fragment  of  the  warehouse,  and  the  entire  warehouse  is  administered  centrally.  An 
alternative  architecture  is  a  federation  of  data  warehouses  or  data  marts,  each  with  its  own 
repository  and  decentralized  administration. 

1.2     Data  Mining 

Data  mining,  also  referred  to  as  knowledge  discovery  in  databases,  is  the  process  of 
extracting  implicit,  understandable,  previously  unknown  and  potentially  useful  information 
from  data.  In  other  words,  data  mining  is  the  act  of  drilling  through  huge  volumes  of  data  in 
order  to  discover  relationships,  or  to  answer  specific  questions  that  are  too  broad  in  nature 
for  traditional  query  tools.  Data  mining  is  also  projected  as  the  next  step  beyond  online 
analytical  processing  (OLAP)  for  querying  data  warehouses.  Rather  than  seek  out  known 
relationships,  it  sifts  through  data  for  unknown  relationships.  For  instance,  consider  the 
transaction  data  in  a  mail-order  company  stored  in  the  following  relations:  sales  (customer, 
widget,  state,  year),  booster(customer,  widget,  driver),  catalog  (widget,  manufacturer).  The 
sale  information  is  stored  in  the  sales  table  and  the  catalog  table  stores  the  widgets  of 
different  manufacturers.  The  booster  table  stores  the  driver  that  influences  the  particular 
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sale.  In  this  database,  OLAP  finds  answers  to  queries  of  the  form  "How  many  widgets  were 
sold  in  the  first  quarter  of  1998  in  California  vs.  Florida?"  However,  data  mining  attempts 
to  answer  queries  like  "What  are  the  drivers  that  caused  people  to  buy  these  widgets  from 
our  catalog?"  Fundamentally,  data  mining  is  statistical  analysis  and  has  been  in  practice  for 
a  long  time.  But,  until  recently,  statistical  analysis  was  a  time-consuming,  manual  process 
which  limited  the  amount  of  data  that  could  be  analyzed  and  the  accuracy  depended  heavily 
on  the  personnel  involved  in  the  analysis.  Today,  with  the  advent  of  various  sophisticated 
technologies,  tools  exist  that  automate  the  process,  making  data  mining  a  practical  solution 
for  a  wide  range  of  companies.  For  example,  Fingerhut's  (a  direct-mail  catalog  company) 
statistical  analysis  was  limited  to  taking  samples  of  10  to  20  percent  of  its  customers.  With 
data  mining,  it  can  examine  300  specific  characteristics  of  each  of  the  10  to  12  million 
customers  in  a  much  more  focused  way  [15]. 

The  initial  efforts  on  data  mining  research  were  to  cull  together  techniques  from  ma- 
chine learning  and  statistics  [24,  28,  29,  37,  39,  40,  60,  65,  81,  90,  95,  102,  103,  104]  to  define 
new  mining  operations  and  develop  algorithms  for  them  [5,  7,  19,  77].  In  the  remainder  of 
this  section,  we  briefly  introduce  the  various  data  mining  problems. 

1.2.1     Association  Rule 

Association  rule  which  captures  co-occurrence  of  items  or  events  was  introduced  in 
the  context  of  market  basket  data  [7].  An  example  of  such  a  rule  might  be  that  60%  of 
transactions  containing  beer  also  contain  diapers  and  2%  of  transactions  contain  both  these 
items.  Here  60%  is  the  support  and  2%  is  the  confidence  of  the  rule  beer  =4>  diaper. 

Association  rule  mining  is  stated  formally  as  follows  [13].  Let  1  —  {i\,i2,  ■  ■  ■  ,im}  be 
a  set  of  literals,  called  items.   Let  V  be  a  set  of  transactions,  where  each  transaction  T  is 


a  set  of  items  such  that  TCI.  Each  transaction  has  a  unique  identifier,  called  its  tid. 
An  association  rule  is  an  implication  of  the  form  X  ==$■  Y,  where  X  C  I,  Y  C  I,  and 
InV^I.  The  rule  X  =>  Y  holds  in  the  transaction  set  V  with  confidence  c  if  c%  of 
transactions  in  V  that  contain  X  also  contain  Y .  The  rule  X  =>  y  has  support  s  in  the 
transaction  set  V  if  s%  of  transactions  in  V  contain  X  UY.  Given  a  set  of  transactions  V, 
the  problem  of  mining  association  rules  is  to  generate  all  association  rules  that  have  support 
and  confidence  greater  than  the  user-specified  minimum  support  and  minimum  confidence. 
The  association  rule  mining  problem  has  attracted  tremendous  attention  from  data 
mining  researchers  and  several  algorithms  have  been  proposed  for  it  [13,  26,  66,  111,  131]. 
Researchers  have  also  proposed  several  variants  of  the  basic  association  rule  mining  to 
handle  taxonomies  over  items  [63,  118],  numeric  attributes  [76,  92,  119]  and  constraints  on 
items  appearing  in  rules  [97,  121]. 

1.2.2     Sequential  Patterns 

Sequential  pattern  mining  was  first  introduced  in  Agrawal  and  Srikant  [14]  and  further 
generalized  in  Srikant  and  Agrawal  [120].  Given  a  set  of  data-sequences  each  of  which  is  a  list 
of  transactions  ordered  by  the  transaction  time,  the  problem  of  mining  sequential  patterns  is 
to  discover  all  sequences  with  a  user-specified  minimum  support.  Each  transaction  contains 
a  set  of  items.  A  sequential  pattern  is  an  ordered  list  (sequence)  of  itemsets.  The  itemsets 
that  are  contained  in  the  sequence  are  called  elements  of  the  sequence.  For  example, 
({computer,  modem)(pr inter))  is  a  sequence  with  two  elements;  (compirier,  modem)  and 
(printer).  The  support  of  a  sequential  pattern  is  the  percentage  of  data-sequences  that 
contain  the  sequence.  A  sequential  pattern  can  be  further  qualified  by  specifying  maximum 
and/or  minimum  time  gaps  between  adjacent  elements  and  a  sliding  time  window  within 
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which  items  are  considered  part  of  the  same  sequence  element.  These  time  constraints  are 
specified  by  three  parameters,  max-gap,  min-gap  and  window-size.  We  refer  the  reader  to 
Srikant  and  Agrawal  [120]  for  a  formal  definition  of  the  problem. 

1.2.3     Classification 

Classification  is  a  well-studied  problem  [91,  129].  However,  only  recently  has  there  been 
focus  on  algorithms  that  can  handle  large  datasets  [17,  53,  85,  106,  114].  Classification  is 
the  process  of  generating  a  description  or  a  model  for  each  class  of  a  given  dataset,  called 
the  training  set.  Each  record  of  the  training  set  consists  of  several  attributes  which  could 
be  continuous  (coming  from  an  ordered  domain),  or  categorical  (coining  from  an  unordered 
domain).  One  of  the  attributes,  called  the  classifying  attribute,  indicates  the  class  to  which 
each  record  belongs.  Once  a  model  is  built  from  the  given  examples,  it  can  be  used  to 
determine  the  class  of  future  unclassified  records. 

Several  classification  models  based  on  neural  networks,  statistical  models  like  lin- 
ear/quadratic discriminants,  decision  trees  and  genetic  models  have  been  proposed  over  the 
years.  Decision  trees  are  particularly  suited  for  data  mining  since  they  can  be  constructed 
relatively  fast  compared  to  other  methods  and  they  are  simple  and  easy  to  understand. 
Moreover,  trees  can  be  easily  converted  into  SQL  statements  that  can  be  used  to  access 
databases  efficiently  [5]. 

A  decision  tree  is  a  class  discriminator  that  recursively  partitions  the  training  set 
until  each  partition  consists  entirely  or  dominantly  of  examples  from  one  class.  Each  non- 
leaf  node  of  the  tree  contains  a  split  point  which  is  a  test  on  one  or  more  attributes  and 
determines  how  the  data  is  partitioned.  Figure  1.2  shows  a  sample  decision-tree  classifier 
and  the  training  set  from  which  it  is  derived.  Each  record  represents  a  credit  card  applicant 


and  we  are  interested  in  building  a  model  that  categorizes  the  applicants  into  high  and  low 


risk  classes. 
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Figure  1.2.  Credit  card  classification  example 


1.2.4     Clustering 

The  fundamental  clustering  problem  is  that  of  grouping  together  (clustering)  similar 
data  items  and  is  useful  for  discovering  interesting  distributions  and  patterns  in  the  under- 
lying data.  Clustering  has  been  formulated  in  various  ways  in  the  machine  learning  [48], 
pattern  recognition  [51],  optimization  [112]  and  statistics  literature  [20].  The  problem  of 
clustering  can  be  defined  as  follows:  given  n  data  points  in  a  d-dimensional  metric  space, 
partition  the  data  points  into  k  clusters  such  that  the  data  points  within  a  cluster  are  more 
similar  to  each  other  than  data  points  in  different  clusters.  The  classic  K-Means  clustering 
algorithm  starts  with  estimated  initial  values  for  the  A;  cluster  centroids.  Each  of  the  data 
points  is  assigned  to  the  cluster  with  the  nearest  centroid.  After  the  assignment  the  cen- 
troids are  refined  to  the  mean  of  the  data  points  in  that  cluster.  This  process  is  repeated 
several  times  until  an  acceptable  convergence  is  reached.  There  are  several  research  efforts 
reported  in  the  data  mining  literature  for  clustering  large  databases  [23,  42,  57,  132]. 


Research  on  time  series  analysis  [10,  43,  75],  similarity  search  [3,  8,  21,  72,  115], 
high  dimensional  data  analysis  [4,  22,  79],  and  text  data  management  [30,  31]  can  also  be 
classified  under  the  broad  category  of  data  mining. 

1.3     Mining  Databases 

The  initial  research  efforts  on  data  mining  were  aimed  at  defining  new  mining  problems 
and  a  majority  of  the  algorithms  for  them  were  developed  for  data  stored  in  file  systems. 
Each  had  its  own  specialized  data  structures,  buffer  management  strategies  and  so  on  [8,  13, 
14,  22,  49,  50,  62,  63,  74,  84,  85,  86,  111,  115,  119,  121].  In  cases  where  the  data  are  stored  in 
a  DBMS,  data  access  was  provided  through  an  ODBC  or  SQL  cursor  interface  [2,  64,  68,  71]. 
Data  mining  tools  are  being  used  in  several  application  domains  [16,  18,  27,  38,  41,  46,  52, 
70,  82,  96,  101,  108,  130].  Coupling  the  data  mining  tools  with  a  growing  base  of  accessible 
enterprise  data — often  in  the  form  of  a  data  warehouse — provides  business  institutions  at 
its  disposal  a  tool  with  immense  implications.  According  to  the  vice  president  of  Mellon 
Bank's  advanced  technology  group,  "Data  mining  is  the  carrot  that  justifies  the  expensive 
stick  of  building  a  data  warehouse." 

The  majority  of  the  warehouse  stores — systems  used  for  storing  warehouse  data — are 
relational  databases  or  their  variants.  The  advantages  of  using  database  systems  are  numer- 
ous: SQL  was  invented  for  direct  query  of  data,  most  client/server  connectivity  is  supplied 
by  relational  vendors,  most  replication  systems  have  been  designed  with  relational  sources 
and  targets  and  most  of  the  relational  vendors  are  delivering  parallel  database  solutions. 
There  are  important  alternatives  in  this  segment,  however.  The  OLAP  multidimensional  en- 
gines offer  unique  performance  characteristics  across  their  problem  domain.  We  might  also 
see  traditional  file  stores  providing  significantly  better  performance  for  some  data  mining 


operations.  Differentiators  for  the  relational  engines  will  be  overall  scalability,  availability 
across  a  broad  spectrum  of  hardware  platforms,  "affinity"  with  legacy  data  platforms  and 
stores,  availability  of  extensions  for  image,  text  and  so  on  to  support  the  next  generation 
of  applications,  and  availability  of  supporting  tools. 

The  investment  in  building  and  managing  a  data  warehouse  is  enormous.  With  the 
advent  of  business  intelligence  and  decision  support  systems,  it  is  imperative  that  the 
data  warehouse  support  multiple  applications.  Figure  1.3  illustrates  how  a  data  warehouse 
will  typically  be  used  in  a  business  organization.  The  spectrum  of  applications  include 
basic  querying  and  reporting  tools,  OLAP  tools,  multidimensional  analysis  tools  and  data 
mining  tools.  There  are  several  excellent  and  popular  query/report  writers.  There  are  also 
tools  that  support  multidimensional  analysis  directly  on  relational  data  stores  without  the 
separate  OLAP  engine.  Note  that  these  tools  are  like  tools  in  a  tool  box;  that  is,  they  can 
be  used  in  combination  to  produce  the  desired  result.  There  is  no  single  class  of  tools  that 
satisfies  the  broad  range  of  decision  support  and  business  intelligence  system  requirements. 
Therefore,  it  is  crucial  that  the  data  mining  tools  integrate  with  relational  data  warehouses 
much  like  the  query/report  and  OLAP  tools. 

1.4     Goal 

The  goal  of  this  dissertation  is  to  explore  the  various  issues  of  database/data  ware- 
house integration  of  mining  operations.  We  first  study  the  various  architectural  alternatives 
for  coupling  data  mining  with  relational  database  systems,  primarily  from  a  performance 
perspective.  We  develop  various  SQL  formulations  for  association  rules  [7],  a  representative 
mining  problem,  and  analyze  how  competitive  can  the  SQL  implementation  be  compared 
to  other  specialized  implementations  of  association  rule  mining.   We  further  focus  on  the 
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Figure  1.3.  Typical  data  warehouse  usage 

analysis  of  various  execution  plans  chosen  by  relational  database  systems  for  executing  some 
of  the  SQL-based  mining  queries.  We  expect  that  this  study  will  reveal  the  domain-specific 
semantic  information  of  the  mining  algorithms  that  need  to  be  integrated  into  next  gen- 
eration query  optimizers  to  handle  mining  computations  efficiently.  We  also  develop  SQL 
formulations  for  a  few  other  mining  operations,  namely,  generalized  association  rules  [118] 
and  sequential  patterns  [14,  120].  We  also  propose  a  few  primitive  database  operators  that 
are  useful  for  mining  and  other  decision  support  applications.  These  operators,  if  natively 
supported  in  a  database  system,  could  potentially  speed  up  the  execution  of  mining  queries. 
Data  warehouses  on  which  the  mining  tools  operate  typically  are  populated  incremen- 
tally. In  order  to  improve  the  reliability  and  usefulness  of  the  discovered  information,  large 
volumes  of  data  need  to  be  collected  and  analyzed  over  a  period  of  time.  A  naive  approach 
to  update  the  mined  information,  when  new  data  are  added  or  part  of  the  current  data 
are  deleted,  is  to  recompute  them  from  scratch.  However,  it  would  be  ideal  to  develop  an 
incremental  algorithm  so  that  the  computation  effort  spent  on  the  original  data  can  be 
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effectively  utilized.  We  develop  an  incremental  algorithm  and  its  SQL  formulations  for  up- 
dating the  association  rules,  based  on  the  negative  border  concept.  It  is  based  on  the  closure 
property  of  frequent  itemsets,  that  is,  all  subsets  of  a  frequent  itemset  are  also  frequent. 
We  show  that  the  incremental  mining  algorithm  can  be  generalized  to  handle  various  kinds 
of  constraints.  We  categorize  the  constraints  based  on  their  usage  in  the  mining  process 
and  develop  a  general  framework  to  process  them.  We  further  show  that  the  incremental 
technique  is  applicable  to  other  classes  of  mining  and  decision  support  problems  such  as 
sequential  patterns,  correlation  rules,  query  flocks  and  so  on. 
We  summarize  and  list  the  goals  of  this  dissertation  below: 

•  Survey  the  various  data  mining  problems  and  algorithms, 

•  Study  the  different  database  integration  alternatives  for  data  mining, 

•  Develop  and  implement  SQL  formulations  of  mining  algorithms, 

•  Analyze  the  performance  profile  of  current  DBMSs  to  execute  the  above  SQL  queries, 

•  Explore  the  enhancements  to  current  cost-based  optimizers  to  incorporate  the  domain- 
specific  semantics  of  mining, 

•  Develop  an  incremental  association  rule  mining  algorithm  and  its  SQL  formulations, 

•  Generalize  the  incremental  algorithm  for  mining  constrained  associations  and  show 
its  applicability  to  other  data  mining  and  decision  support  problems,  and 

•  Explore  primitive  operators  for  mining  in  databases. 

This  work  has  a  significant  impact  on  the  state-of-the-art  in  data  mining  system  archi- 
tectures and  comes  at  the  appropriate  time  when  the  data  mining  community  is  looking  for 
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answers  to  "How  to  mine  data  warehouses?"  Given  the  amount  of  data  involved  in  mining, 
its  potential  impact  on  various  business  sectors  and  the  fact  that  OLAP  is  finding  its  way 
into  commercial  database  systems  (for  example,  the  cube  operator),  it  is  only  a  matter  of 
time  before  mining  becomes  an  integral  part  of  database  systems.  We  believe  that  this  work 
is  a  small  but  strong  step  in  the  right  direction.  This  will  also  have  a  significant  impact  on 
query  optimization  and  parallel  query  processing  techniques. 

1.5     Thesis  Organization 

The  rest  of  this  dissertation  is  organized  as  follows:  We  discuss  the  various  architectural 
alternatives  for  integrating  mining  with  database  systems/data  warehouses  in  Chapter  2. 
The  various  SQL  formulations  of  association  rules,  their  performance  profiles  and  optimiza- 
tions for  improving  the  performance  are  detailed  in  Chapter  3.  In  Chapter  4,  we  present 
the  use  of  object-relational  extensions  to  SQL  for  improving  the  performance  of  SQL-based 
association  rule  mining  and  the  performance  comparison  of  the  various  architectural  al- 
ternatives. SQL-based  mining  of  generalized  association  rules  and  sequential  patterns  are 
described  in  Chapters  5  and  6  respectively.  Chapter  7  presents  the  incremental  associa- 
tion rule  mining  algorithm,  performance  comparison,  SQL  formulation  and  generalizations 
for  mining  constrained  associations.  We  conclude  in  Chapter  8  with  a  discussion  of  the 
proposed  database  operators  and  avenues  for  further  research. 


CHAPTER  2 
FROM  FILE  MINING  TO  DATABASE  MINING 


The  "first-generation"  KDD  systems  offer  isolated  discovery  features  using  tree  induc- 
ers, neural  nets  and  rule  discovery  algorithms.  Such  systems  cannot  be  embedded  into  a 
large  application  and  typically  offer  just  one  knowledge  discovery  feature.  The  current  state 
of  data  mining  systems  is  very  much  similar  to  the  days  in  which  database  applications  were 
written  in  COBOL  with  just  read  and  write  commands  as  the  interface  to  data  stored  in 
large  files.  The  advent  of  relational  database  systems  which  offered  SQL  for  ad  hoc  queries 
and  various  relational  APIs  (application  programming  interfaces)  for  application  program- 
ming made  database  applications  much  easier  to  develop  and  manage.  Data  mining  has  to 
undergo  a  similar  transition  from  the  current  "file  mining"  to  data  warehouse  mining  and 
a  richer  set  of  APIs  for  developing  business  intelligence  and  decision  support  applications. 

In  the  remainder  of  this  chapter,  we  survey  some  of  the  prior  research  related  to  the 
database  integration  of  mining  in  Section  2.1.  The  various  architectural  alternatives  are 
discussed  in  Section  2.2. 

2.1     Related  Work 

Researchers  have  started  to  focus  on  various  issues  related  to  integrating  mining  with 
databases  [6,  67,  68].  The  research  on  database  integration  of  mining  can  be  broadly  clas- 
sified into  two  categories;  one  which  proposes  new  mining  operators  and  the  other  which 
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leverages  the  query  processing  capabilities  of  current  relational  DBMSs.  In  the  former  cate- 
gory, there  have  been  language  proposals  to  extend  SQL  with  specialized  mining  operators. 
A  few  examples  are  (i)  the  query  language  DMQL  proposed  by  Han  et  al.  [64]  which  ex- 
tends SQL  with  a  collection  of  operators  for  mining  characteristic  rules,  discriminant  rules, 
classification  rules,  association  rules  and  so  on,  (ii)  The  M-SQL  language  of  Imielinski  and 
Virmani  [69]  which  extends  SQL  with  a  special  unified  operator  Mine  to  generate  and 
query  a  whole  set  of  propositional  rules,  and  (iii)  the  mine  rule  operator  proposed  by  Meo 
et  al.  [89]  for  a  generalized  version  of  the  association  rule  discovery  problem.  However,  they 
do  not  address  the  processing  techniques  for  these  operators  inside  a  database  engine  and 
the  interaction  of  the  standard  relational  operators  and  the  proposed  extensions.  It  is  also 
important  to  break  these  operators  to  a  finer  level  of  granularity  in  order  to  identify  com- 
monalities between  them  and  derive  a  set  of  primitive  operators  that  should  be  supported 
natively  in  a  database  engine. 

In  the  second  category,  researchers  have  addressed  the  issue  of  exploiting  the  capa- 
bilities of  conventional  relational  systems  and  their  object-relational  extensions  to  execute 
mining  operations.  This  entails  transforming  the  mining  operations  into  database  queries 
and  in  some  cases  developing  newer  techniques  that  are  more  appropriate  in  the  database 
context.  The  proposal  of  Agrawal  and  Shim  [12]  for  tightly  coupling  a  mining  algorithm 
with  a  relational  database  system  makes  use  of  user-defined  functions  (UDFs)  in  SQL 
statements  to  selectively  push  parts  of  the  application  that  perform  computations  on  data 
records  into  the  database  system.  The  objective  was  to  avoid  one-at-a-time  record  retrieval 
from  the  database  to  the  application  address  space,  saving  both  the  copying  and  process 
context  switching  costs.  In  the  KESO  project  [116],  the  focus  is  on  developing  a  data 
mining  system  which  interacts  with  standard  DBMSs.   The  interaction  with  the  database 
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is  restricted  to  two-way  table  queries,  a  special  kind  of  aggregate  query.  Two-way  tables, 
which  are  used  in  the  mining  process,  have  sets  of  source  and  target  attributes  and  an 
associated  count.  Association  rule  mining  was  formulated  as  SQL  queries  in  the  SETM 
algorithm  [66].  However,  it  does  not  use  the  subset  property — all  subsets  of  a  frequent 
itemset  are  frequent — for  candidate  generation.  As  a  result,  SETM  counts  a  large  number 
of  candidate  itemsets  in  the  support  counting  phase  and  hence  is  not  efficient  [9].  Query 
flocks  generalizes  boolean  association  rules  to  mine  associations  across  relational  tables.  A 
query  flock  is  a  parameterized  query  with  a  filter  condition  to  eliminate  the  values  of  param- 
eters that  are  "uninteresting".  Tsur  et  al.  [128]  present  the  use  of  query  flocks  for  mining 
and  emphasizes  the  need  for  incorporating  the  a-priori  technique  into  new  generation  query 
optimizers  to  handle  mining  queries  efficiently. 

2.2     Architectural  Alternatives 

The  various  architectural  alternatives  for  integrating  mining  with  relational  systems, 
proposed  in  [109],  can  be  categorized  as  shown  in  Figure  2.1.  It  shows  a  taxonomy  of 
various  alternatives  from  loose-coupling  to  an  integrated  approach.  In  the  remainder  of 
this  section,  we  describe  each  of  the  alternatives. 

2.2.1     Loose- Coupling 

This  is  an  example  of  integrating  mining  applications  into  the  client  in  a  client/server 
architecture  or  into  the  application  server  in  a  multi-tier  architecture.  The  mining  kernel 
can  be  considered  as  the  application  server.  In  this  approach,  data  are  read  tuple  by  tuple 
from  the  DBMS  to  the  mining  kernel  using  a  cursor  interface.  The  intermediate  and  the 
final  results  are  stored  back  into  the  DBMS.  The  data  are  never  copied  on  to  a  file  system. 
Instead,  the  DBMS  is  used  as  a  file  system.    This  is  the  approach  followed  by  most  existing 
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Figure  2.1.  Taxonomy  of  architectural  alternatives 


mining  systems.  A  potential  problem  with  this  approach  is  the  high  context  switch  costs 
between  the  DBMS  and  the  mining  kernel  processes  since  they  run  in  different  address 
spaces.  In  spite  of  the  block-read  optimization  present  in  many  systems  (for  example, 
Oracle  [98],  DB2  [32])  where  a  block  of  tuples  is  read  at  a  time  instead  of  reading  a  single 
tuple  at  a  time,  the  performance  could  suffer.  This  architecture  is  outlined  in  Figure  2.2 
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Figure  2.2.  Loose-coupling  architecture 


2.2.2     Cache-Mine 

This  is  a  special  case  of  the  loose-coupling  approach  where  the  mining  algorithm  reads 
data  from  the  DBMS  only  once  and  caches  the  relevant  data  in  flat  files  on  local  disk  for 
future  references.  The  data  could  be  transformed  and  cached  in  a  format  that  enables  more 
efficient  access  in  the  future.  The  mined  results,  first  generated  as  flat  files,  are  imported 
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into  the  DBMS.  This  method  has  all  the  advantages  of  the  stored  procedure  approach 
(described  below)  plus  it  promises  to  have  better  performance.  The  disadvantage  is  that  it 
requires  additional  disk  space  for  caching.  This  architecture  is  outlined  in  Figure  2.3. 
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Figure  2.3.  Cache-mine  architecture 


2.2.3     Stored-Procedure 


This  architecture  is  representative  of  the  case  where  the  mining  logic  is  embedded  as 
applications  on  the  database  server.  In  this  approach,  shown  in  Figure  2.4,  the  mining 
algorithm  is  encapsulated  as  a  stored  procedure  [32]  that  is  executed  in  the  same  address 
space  as  the  DBMS.  The  main  advantage  of  this  as  well  as  the  loose-coupling  and  cache- 
mine  approach  is  greater  programming  flexibility.  Also,  any  existing  file  system  code  can 
be  easily  transformed  to  work  on  data  stored  in  the  DBMS. 
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Figure  2.4.  Stored-procedure  architecture 


2.2.4     User-Defined  Function 

This  approach  is  another  variant  of  embedding  mining  as  an  application  on  the  database 
server  if  the  user-defined  functions  are  run  in  the  unfenced  mode  (same  address  space  as 
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the  server)  [32].  In  this  case,  the  entire  mining  algorithm  is  encapsulated  as  a  collection  of 
user-defined  functions  (UDFs)  [32]  that  are  appropriately  placed  in  SQL  data  scan  queries. 
The  architecture  is  represented  in  Figure  2.5.  Most  of  the  processing  happens  in  the  UDF 
and  the  DBMS  is  used  simply  to  provide  tuples  to  these  UDFs.  Little  use  is  made  of  the 
query  processing  capabilities  of  the  DBMS.  The  UDFs  can  be  run  in  either  fenced  (different 
address  space)  or  unfenced  (same  address  space)  mode.  The  main  attraction  of  this  method 
is  performance  since  when  run  in  the  unfenced  mode  individual  tuples  never  have  to  cross 
the  DBMS  boundary.  Otherwise,  the  processing  happens  in  almost  the  same  manner  as 
in  the  stored  procedure  case.  The  main  disadvantage  is  the  development  cost  [12]  since 
the  entire  mining  algorithm  has  to  be  written  as  UDFs  involving  significant  code  rewrites. 
Further,  these  are  "heavy-weight"  UDFs  which  involve  significant  processing  and  memory 
management. 
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Figure  2.5.  UDF-based  mining  architecture 


In  order  to  provide  a  query  interface  or  application  programming  interface  to  the 
discovered  rules,  they  can  be  passed  through  a  post-processing  step.  The  rule  discovery 
itself  could  be  done  by  any  of  the  above  alternatives. 

2.2.5     SQL-Based  Approach 

This  is  the  integration  architecture  explored  in  this  dissertation.  In  this  approach,  the 
mining  algorithm  is  formulated  as  SQL  queries  which  are  executed  by  the  DBMS  query 
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processor.  We  develop  several  SQL  formulations  for  a  few  representative  mining  operations 
in  order  to  better  understand  the  performance  profile  of  current  database  query  processors 
in  executing  these  queries.  We  believe  that  it  will  enable  us  to  identify  what  portions  of 
these  mining  operations  can  be  pushed  down  to  the  query  processing  engine  of  a  DBMS. 

There  are  also  several  potential  advantages  of  a  SQL  implementation.  One  can  prof- 
itably make  use  of  the  database  indexing  and  query  processing  capabilities  thereby  leverag- 
ing on  more  than  two  decades  of  development  effort  spent  in  making  these  systems  robust, 
portable,  scalable,  and  highly  concurrent.  Rather  than  devising  specialized  paralleliza- 
tions,  one  can  potentially  exploit  the  underlying  SQL  parallelization,  particularly  in  an 
SMP  environment.  The  current  approach  to  parallelizing  mining  algorithms  is  to  develop 
specialized  parallelizations  for  each  of  the  algorithms  [11,  61,  113,  114].  The  DBMS  sup- 
port for  check-pointing  and  space  management  can  be  especially  valuable  for  long-running 
mining  algorithms  on  huge  volumes  of  data.  The  development  of  new  algorithms  could  be 
significantly  faster  if  expressed  declaratively  using  a  few  SQL  operations.  This  approach  is 
also  extremely  portable  across  DBMS's  since  porting  becomes  trivial  if  the  SQL  approaches 
use  only  the  standard  SQL  features. 
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Figure  2.6.  SQL  architecture  for  mining  in  a  DBMS 


The  architecture  we  have  in  mind  is  schematically  shown  in  Figure  2.6.  We  visualize 
that  the  desired  mining  operation  will  be  expressed  in  some  extension  of  SQL  or  a  graphical 
language.    A  preprocessor  will  generate  appropriate  SQL  translation  for  this  operation. 
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This  preprocessor  will  be  able  to  select  the  right  translation  taking  into  account  input 
data  distributions.  We  consider  SQL  translations  that  can  be  executed  on  a  SQL-92  [88] 
relational  engine,  as  well  as  translations  that  require  some  of  the  newer  object-relational 
capabilities  being  designed  for  SQL  [78].  Specifically,  we  assume  availability  of  blobs,  user- 
defined  functions,  and  table  functions  [80]  in  the  Object-Relational  engine.  We  do  not 
require  any  mining  specific  extension  in  the  underlying  execution  engine;  identification  of 
such  extensions  is  one  of  the  goals  of  this  study. 

We  do  quantitative  and  qualitative  comparisons  of  some  of  the  architectural  alter- 
natives listed  here.  Our  primary  focus  is  on  the  performance  of  various  architectural  al- 
ternatives and  the  identification  of  possible  enhancements  to  the  query  optimizer  and  the 
query  processing  engine.  The  issues  of  the  language  constructs  required  to  extend  SQL 
with  mining  features,  and  the  details  of  the  preprocessing  step  shown  in  Figure  2.6  are 
secondary. 

It  might  also  be  possible  to  integrate  mining  with  databases  using  the  newer  extension 
technologies  like  database  extenders,  data  cartridges  or  data  blades. 

2.2.6     Integrated  Approach 

This  is  the  tightest  form  of  integration  where  the  mining  operations  are  an  integral 
part  of  the  database  query  engine.  In  this  approach  there  is  no  clear  boundary  between 
simple  querying,  OLAP  and  mining;  that  is  querying  and  mining  are  treated  to  be  similar 
operations.  The  user's  goal  is  to  get  information  from  the  data  store.  He/she  should  not 
have  to  make  the  distinction  as  to  whether  it  is  the  result  of  querying/OLAP/mining. 
This  entails  unbundling  the  bulky  mining  operations  and  identifying  common  operator 
primitives  with  which  the  mining  operations  can  be  composed.  We  cannot  expect  to  have 
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a  specialized  operator  for  every  mining  task.  It  also  needs  a  language  in  which  the  required 
operations  can  be  specified.  In  order  to  realize  this  goal,  it  requires  tremendous  amount 
of  research  in  various  aspects  like  designing  language  extensions,  better  query  processing 
and  optimization  strategies.  However,  we  envision  that  the  query  processing  engine  will 
eventually  be  extended  with  primitive  mining  operators.  When  that  is  accomplished,  a 
mining  system  architecture  will  resemble  the  one  shown  in  Figure  2.7. 
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Figure  2.7.  Architecture  for  mining  in  next-generation  DBMSs 


2.3     Summary 

The  first  two  approaches  in  the  architecture  taxonomy  in  Figure  2.1,  namely,  mining  in 
the  application  server  or  database  server,  facilitates  the  move  from  file  mining  to  database 
mining  rather  easily.  However,  as  explained  in  the  loose-coupling,  cache-mine,  stored- 
procedure  and  user-defined  function  approaches,  they  do  not  utilize  the  query  processing 
functionality  provided  by  the  DBMS. 

In  this  dissertation  we  pursue  the  third  approach  in  Figure  2.1,  which  uses  SQL  and 
its  extensions  to  implement  the  mining  algorithms.  This  acts  as  a  pre-cursor  to  determine 
the  extensions  to  current  query  processors  and  optimizers  in  order  to  move  towards  the  last 
approach  which  is  the  truly  integrated  approach. 

Note  The  architectural  alternatives  in  Section  2.2  were  developed  primarily  by 
researchers  from  IBM  Almaden  Research  Center  and  the  author  was  a  contributor. 


CHAPTER  3 
ASSOCIATION  RULES 


In  this  chapter,  we  discuss  the  various  SQL-92  (SQL  with  no  object-relational  ex- 
tensions) formulations  of  association  rule  mining.  We  start  with  a  review  of  the  apriori 
algorithm  for  association  rule  mining  in  Section  3.1.  A  few  other  algorithms  for  mining- 
association  rules  are  briefly  outlined  in  Section  3.2.  The  input-output  data  formats  are  de- 
scribed in  Section  3.3  and  in  Section  3.4,  we  introduce  SQL-based  association  rule  mining. 
The  various  SQL-92  formulations  are  presented  in  Section  3.5.  We  present  experimental 
results  showing  the  performance  of  these  formulations  on  some  real-life  datasets  in  Sec- 
tion 3.6.  In  Section  3.7,  we  develop  cost  formulae  for  the  cost  of  executing  the  above  SQL 
queries  on  a  query  processor,  based  on  the  input  data  parameters  and  relational  operator 
costs.  A  few  performance  optimizations  to  the  basic  SQL-92  approaches  and  the  corre- 
sponding performance  gains  are  presented  in  Section  3.8.  Section  3.9  quantifies  the  overall 
performance  improvements  of  the  optimizations  with  experiments  on  synthetic  datasets. 

The  association  rule  mining  problem  outlined  in  Section  1.2.1  can  be  decomposed  into 
two  subproblems  [7]. 

•  Find  all  combinations  of  items  whose  support  is  greater  than  minimum  support.  Call 
those  combinations  frequent  itemsets. 

•  Use  the  frequent  itemsets  to  generate  the  desired  rules.  The  idea  is  that  if,  say,  ABCD 
and  AB  are  frequent  itemsets,  then  we  can  determine  if  the  rule  AB^-CD  holds  by 
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computing  the  ratio  r  —  support(^5CZ))/support(Ai?).  The  rule  holds  only  if  r  > 
minimum  confidence.  Note  that  the  rule  will  have  minimum  support  because  ABCD 
is  frequent. 

3.1     Apriori  Algorithm 

We  use  the  Apriori  algorithm  [9]  as  the  basis  for  our  presentation.  There  are  recent 
proposals  aimed  at  improving  the  performance  of  the  Apriori  algorithm  by  reducing  the 
number  of  data  passes  [26,  127].  They  all  have  the  same  basic  data-flow  structure  as  the 
Apriori  algorithm.  Our  goal  is  to  understand  how  best  to  integrate  this  basic  structure 
within  a  database  system. 

F\  =  {frequent  1-itemsets} 
for(A;  =  2;JPfc_1/0;A;  +  +)  do 

Ck  =  apriori-gen(Fyt_1);  //  generate  new  candidates 
forall  transactions  t  G  V  do 

Ct  =  subset(C\,£);  //  find  all  candidates  contained  in  t 
forall  candidates  c  G  Ct  do 

c.  count ++; 
done 
done 

-ffe  =  {c  G  Cfc|c.count  >  minsup} 
done 
Answer  =  \JkFk; 

Figure  3.1.  Apriori  algorithm 
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The  basic  Apriori  algorithm  shown  in  Figure  3.1  makes  multiple  passes  over  the  data. 
In  the  k  pass  it  finds  all  itemsets  with  k  items  having  the  minimum  support,  called  the 
frequent  fc-itemsets.  Each  pass  consists  of  two  phases.  Let  F^  represent  the  set  of  frequent 
fc-itemsets,  and  Ck  the  set  of  candidate  A>itemsets  (potentially  frequent  itemsets).  First  is 
the  candidate  generation  phase  where  the  set  of  all  frequent  (k— l)-itemsets,  Fjt_i,  found 
in  pass  (A;  —  1),  is  used  to  generate  the  candidate  itemsets  Ck-  The  candidate  generation 
procedure  ensures  that  Ck  is  a  superset  of  the  set  of  all  frequent  fc-itemsets.  The  algorithm 
builds  a  specialized  hash-tree  data  structure  in  memory  out  of  Ck-  Then  is  the  support 
counting  phase  where  the  transaction  database  is  scanned.  For  each  transaction,  the 
algorithm  determines  which  of  the  candidates  in  Ck  are  contained  in  the  transaction  using 
the  hash-tree  data  structure  and  increments  their  support  count.  At  the  end  of  the  pass,  Ck 
is  examined  to  determine  which  of  the  candidates  are  frequent,  yielding  Fk-  The  algorithm 
terminates  when  Ck+i  becomes  empty. 

3.2     Other  Algorithms 

There  are  several  other  file  based  algorithms  for  mining  association  rules.  However 
many  of  them  follow  the  basic  apriori  framework  and  tries  to  improve  on  the  computation 
and  I/O  requirements. 

The  partition  algorithm  [111]  reduces  the  I/O  requirements  to  just  two  passes  over  the 
entire  dataset.  The  reason  the  database  needs  to  be  scanned  multiple  times  is  because  the 
number  of  possible  itemsets  to  be  tested  for  support  is  exponentially  large  if  it  must  be  done 
in  a  single  scan  of  the  database.  The  partition  algorithm  generates  a  set  of  all  potentially 
frequent  itemsets  in  one  scan  of  the  database.  This  set  is  a  superset  of  all  frequent  itemsets. 
In  the  second  scan,  the  actual  support  of  these  itemsets  in  the  whole  database  is  computed. 
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The  algorithm  executes  in  two  phases.  In  the  first  phase,  it  divides  the  database  into  a 
number  of  non-overlapping  partitions.  The  frequent  itemsets  in  each  of  these  partitions 
are  computed  separately  which  will  involve  multiple  passes  over  the  data.  However,  the 
partition  sizes  are  chosen  such  that  an  entire  partition  fits  in  main  memory  so  that  it  is 
read  only  once  in  each  phase.  At  the  end  of  the  first  phase,  the  frequent  itemsets  from  all 
the  partitions  are  merged  together  to  generate  the  set  of  all  potentially  frequent  itemsets. 
This  set  is  a  superset  of  all  the  frequent  itemsets  since  all  itemsets  that  are  frequent  in 
the  whole  database  have  to  be  frequent  at  least  in  one  partition.  In  the  second  phase,  the 
actual  support  of  these  itemsets  are  counted  and  the  frequent  itemsets  are  identified. 

Toivonen  proposes  a  sampling  based  algorithm  [127].  The  idea  there  is  to  pick  a 
random  sample,  use  it  to  determine  all  association  rules  that  probably  hold  in  the  whole 
database,  and  then  to  verify  the  results  with  the  rest  of  the  database.  The  algorithm  thus 
produces  exact  association  rules  in  one  full  pass  over  the  database.  In  those  rare  cases  where 
the  sampling  method  does  not  produce  all  association  rules,  the  missing  rules  can  be  found 
in  a  second  pass.  A  superset  of  the  frequent  itemsets  can  be  determined  efficiently  from  a 
random  sample  by  applying  any  level-wise  algorithm  on  the  sample  in  main  memory,  and 
by  using  a  lowered  support  threshold.  In  cases  where  approximate  results  are  sufficient,  the 
sampling  approach  can  significantly  reduce  the  computational  and  I/O  requirements  since 
it  works  on  a  much  smaller  dataset. 

A  different  way  of  counting  support  is  proposed  in  Savasere  et  al.  [Ill]  and  Zaki 
et  al.  [131].  Associated  with  each  item  is  a  tidlist  which  consists  of  all  the  transaction 
identifiers  that  contain  that  item.  The  support  for  an  itemset  can  be  obtained  by  counting 
the  number  of  transactions  that  contain  all  the  items  in  the  itemset.    If  the  tidlists  are 


26 


kept  sorted,  this  operation  can  be  done  by  performing  a  merge-scan  of  the  tidlists  of  all  the 
items  in  the  itemset. 

3.3     Input-Output  Formats 

Input  format  The  transaction  table  T  normally  has  two  column  attributes:  transac- 
tion identifier  (tid)  and  item  identifier  (item).  For  a  given  tid,  typically  there  are  multiple 
rows  in  the  transaction  table  corresponding  to  different  items  that  belong  to  the  same  trans- 
action. The  number  of  items  per  transaction  is  variable  and  unknown  during  table  creation 
time.  Thus,  alternative  schemas  may  not  be  convenient.  In  particular,  assuming  that  all 
items  in  a  transaction  appear  as  different  columns  of  a  single  tuple  [105]  is  not  practical, 
because  often  the  number  of  items  per  transaction  can  be  more  than  the  maximum  number 
of  columns  that  the  database  supports.  For  instance,  for  one  of  the  real-life  datasets  we 
experimented  with,  the  maximum  number  of  items  per  transaction  is  872  and  for  another 
it  is  700.  In  contrast,  the  corresponding  average  number  of  items  per  transaction  is  only 
9.6  and  4.4  respectively.  Even  if  the  database  supports  so  many  columns  for  a  table,  there 
will  be  lot  of  space  wastage  in  that  scheme. 

Output  format  The  output  is  a  collection  of  rules  of  varying  length.  The  maximum 
length  of  these  rules  is  much  smaller  than  the  total  number  of  items  and  is  rarely  more 
than  a  dozen.  Therefore,  a  rule  is  represented  as  a  tuple  in  a  fixed-width  table  where  the 
extra  column  values  are  set  to  NULL  to  accommodate  rules  involving  smaller  itemsets.  The 
schema  of  a  rule  is  (itemu  . . . ,  itemk,  len,  rulem,  confidence,  support)  where  k  is  the  size  of 
the  largest  frequent  itemset.  The  len  attribute  gives  the  length  of  the  rule  (number  of  items 
in  the  rule)  and  the  rulem  attribute  gives  the  position  of  the  -►  in  the  rule.  For  instance, 
if  k  =  5,  the  rule  AB^CD  which  has  90%  confidence  and  30%  support  is  represented  by 
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the  tuple  (A,  B,  C,  D,  NULL,  4,  2,  0.9,0.3).  The  frequent  itemsets  are  represented  the 
same  way  as  rules  but  do  not  have  the  rulem  and  confidence  attributes. 

3.4     Associations  in  SQL 

In  Section  3.4.1  we  present  the  candidate  generation  process  in  SQL.  In  Section  3.4.2 
we  present  the  support  counting  process  and  in  Section  3.4.3  we  present  the  rule  generation 
process. 

3.4.1     Candidate  Generation  in  SQL 

Recall  that  the  apriori  algorithm  for  finding  frequent  itemsets  proceeds  in  a  level-wise 
manner.  In  each  pass  k  of  the  algorithm  we  first  need  to  generate  a  set  of  candidate  itemsets 
Ck  from  frequent  itemsets  F^-i  of  the  previous  pass. 

Given  Fk-\,  the  set  of  all  frequent  (k  —  l)-itemsets,  the  Apriori  candidate  generation 
procedure  [9]  returns  a  superset  of  the  set  of  all  frequent  A;-itemsets.  We  assume  that  the 
items  in  an  itemset  are  lexicographically  ordered.  Since,  all  subsets  of  a  frequent  itemset 
are  also  frequent,  we  can  generate  Ck  from  F^-i  as  follows. 

First,  in  the  join  step,  we  generate  a  superset  of  the  candidate  itemsets  Ck  by  joining 
Fk-i  with  itself  as  shown  below. 

insert  into  Ck  select  I\.item\,  ...,  I\.itemk-\,l2.itemk-i 

from  Fk-\  I\,Fk^  I2 

where         I\.iterri\  =  I-2-item\  and 

I\.itemk^2  —  12-itemk-i  and 
I\.itemi;-\  <  12-itemk-i 
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For  example,  let  F3  be  {{1  2  3},  {1  2  4},  {1  3  4},  {1  3  5},  {2  3  4}}.  After  the  join  step,  C4 
will  be  {{1  2  3  4},  {1  3  4  5}}. 

Next,  in  the  prune  step,  all  itemsets  c  G  Ck,  where  some  (k  -  l)-subset  of  c  is  not 
in  Fk-i,  are  deleted.  Continuing  with  the  example  above,  the  prune  step  will  delete  the 
itemset  {1  3  4  5}  because  the  subset  {1  4  5}  is  not  in  F3.  We  will  then  be  left  with  only 
{1  2  3  4}  in  C4.  We  can  perform  the  prune  step  in  the  same  SQL  statement  as  the  join 
step  above  by  writing  it  as  a  A;-way  join  as  shown  in  Figure  3.2.  A  it-way  join  is  used 
since  for  any  /c-itemset  there  are  k  subsets  of  length  (k  -  1)  for  which  we  need  to  check  in 
Fk-i  for  membership.  The  join  predicates  on  Ix  and  I2  remain  the  same.  After  the  join 
between  Ix  and  I2  we  get  a  /c-itemset  consisting  of  (Ii.itemi, . . . ,  Ii.itemk-i,  I2.iterrik-i) 
as  shown  above.  For  this  itemset,  two  of  its  (k  -  l)-length  subsets  are  already  known  to  be 
frequent  since  it  was  generated  from  two  itemsets  in  Fk-\.  We  check  the  remaining  k  -  2 
subsets  using  additional  joins.  The  predicates  for  these  joins  are  enumerated  by  skipping 
one  item  at  a  time  from  the  fc-itemset  as  follows.  We  first  skip  itemi  and  check  if  the 
subset  (Ii.item2, . . . ,  Ii.itemk-i,I2-itemk-i)  belongs  to  Fk-\  as  shown  by  the  join  with  J3 
in  Figure  3.2.  In  general,  for  a  join  with  Ir  (3  <  r  <  k),  we  skip  item  r  -  2  which  gives  us 
join  predicates  of  the  form 

I\.item\  =  Ir.iteml  and 

Ii.itemr_3  =  Ir.itemr-3  and 
I\.itemr^i  =  Ir.itemT_2  and 

Il.iterrik-i  =  Ir.itemk-2  and 
I2.itemk-\  =  Ir-itemk-i- 


'2') 


Figure  3.3  gives  an  example  for  k  =  4. 

We  construct  a  primary  index  on  (itemi, . . .  ,  i£em;t-i)  of  Fk-\  to  efficiently  process 
these  k-w&y  joins  using  index  probes.  Note  that  sometimes  it  may  not  be  necessary  to  ma- 
terialize Ck  before  the  counting  phase.  Instead,  the  candidate  generation  can  be  pipelined 
with  the  subsequent  SQL  queries  used  for  support  counting. 

(Skip  item_k-2) 
Il.iteml        =  Ik. iteml 

Il.item_.k-1  =  Ik.item_k-2 

I2.item_k-1  =Ik.item_k-l  D^C] 


(Skip  item  1) 
Il.item2        =13. item  1 

Il.item_k-1  =  I3.item_k-2 
I2.item_k-1  =  I3.item_k-1 


F_k-1    Ik 


0<] 


Il.iteml        =  12. iteml 

Il.item_k-2  =  I2.item_k-2 
U.item_k-1  <I2.item_k-l 


F_k-1    13 


XI 


F  k-1    II 


F_k-1    12 


Figure  3.2.  Candidate  generation  for  any  A; 


(Skip  item2) 
Il.iteml  =  I4.iteml 
Il.item3  =  I4.item2 
I2.item3  =  I4.item3 


x 


(Skip  iteml) 
Il.item2  =  I3.iteml 
Il.item3  =  I3.item2 
I2.item3  =  I3.item3 


F3     14 


ex 


Il.iteml  =  12. iteml 
Il.item2  =  I2.item2 
Il.item3  <  I2.item3 


F3     13 


X 


F3     II  F3     12 

Figure  3.3.  Candidate  generation  for  k  —  4 
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3.4.2  Counting  Support  to  Find  Frequent  Itemsets 

This  is  the  most  time-consuming  part  of  the  association  rule  mining  algorithm.  We 
use  the  candidate  itemsets  Ck  and  the  data  table  T  to  count  the  support  of  the  itemsets  in 
Cfc.  We  consider  two  different  categories  of  SQL  implementations. 

(A)  The  first  one  is  based  purely  on  SQL-92.  We  discuss  four  approaches  in  this  category 
in  Section  3.5. 

(B)  The  second  utilizes  the  new  SQL  object-relational  extensions  like  UDFs,  BLOBs 
(binary  large  objects),  table  functions  and  so  on.  Table  functions  [80]  are  virtual 
tables  associated  with  a  user  defined  function  which  generate  tuples  on  the  fly.  Like 
normal  physical  tables  they  have  pre-defined  schemas.  The  function  associated  with 
a  table  function  can  be  implemented  as  any  other  UDF.  Thus,  table  functions  can  be 
viewed  as  UDFs  that  return  a  collection  of  tuples  instead  of  scalar  values. 

We  discuss  six  approaches  in  this  category  in  Chapter  4.  Note  that,  UDFs  in  this 
approach  are  not  heavy  weight  and  do  not  require  extensive  memory  allocations  and 
coding  unlike  in  a  purely  UDF-based  implementation  [12]. 

3.4.3  Rule  Generation 

In  the  second  phase  of  the  association  rule  mining  algorithm,  we  use  the  frequent 
itemsets  to  generate  rules  with  the  user  specified  minimum  confidence,  minconf.  For  every 
frequent  itemset  /,  we  first  find  all  non-empty  proper  subsets  of  I.  Then,  for  each  of  those 
subsets  m,  we  find  the  confidence  of  the  rule  m— ►(/  -  m)  and  output  the  rule  if  it  is  at  least 
minconf. 

In  the  support  counting  phase,  the  frequent  itemsets  of  size  k  are  stored  in  table  Fk. 
Before  the  rule  generation  phase,  we  merge  all  the  frequent  itemsets  into  a  single  table  F. 


■M 


The  schema  of  F  consists  of  k  +  2  attributes  {item\, . . . ,  iterrik,  support,  len),  where  k  is 
the  size  of  the  largest  frequent  itemset  and  len  is  the  length  of  the  itemset  as  discussed 
earlier  in  Section  3.3. 

We  use  the  table  function  GenRules  to  generate  all  possible  rules  from  a  frequent  item- 
set.  The  input  argument  to  the  function  is  a  frequent  itemset.  For  each  itemset,  it  outputs 
tuples  corresponding  to  rules  with  all  non-empty  proper  subsets  of  the  itemset  in  the  con- 
sequent. The  table  function  outputs  tuples  with  k  +  3  attributes,  TJtemi,. . .  ,T  Atem^, 
T support,  Tien,  Tjrulem.  The  output  is  joined  with  F  to  find  the  support  of  the  an- 
tecedent and  the  confidence  of  the  rule  is  calculated  by  taking  the  ratio  of  the  support 
values.  The  predicates  in  the  where  clause  match  the  antecedent  of  the  rule  with  the  fre- 
quent itemset  corresponding  to  the  antecedent.  While  checking  for  this  match,  we  need  to 
check  only  up  to  itemk  where  k  <  Tjrulem.  The  or  part  (t\.Tjrulem  <  k)  in  the  predicate 
accomplishes  this.  Figure  3.4  illustrates  the  rule  generation  query. 

We  can  also  do  rule  generation  without  using  table  functions  and  base  it  purely  on 
SQL-92.  The  rules  are  generated  in  a  level-wise  manner  where  in  each  level  k  we  generate 
rules  with  consequents  of  size  k.  Further,  we  make  use  of  the  property  that  for  any  frequent 
itemset,  if  a  rule  with  consequent  c  holds,  then,  so  do  rules  with  consequents  that  are  subsets 
of  c  as  suggested  in  Agrawal  et  al.  [9].  We  can  use  this  property  to  generate  rules  in  level 
k  using  rules  with  (k  -  1)  long  consequents  found  in  the  previous  level,  much  like  the  way 
we  did  candidate  generation  in  Section  3.4.1. 

The  fraction  of  the  total  running  time  spent  in  rule  generation  is  very  small.  Therefore:, 
we  do  not  focus  much  on  rule  generation  algorithms. 
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insert  into  R  select  TJterrti,  . .  .TJtemk, 

t\ .support.  TJen,  T_rulem,  t\. support/^. support, 
from  F  /i,  table(GenRules(/i.iiemi,. . .  ,f\.iterriki  /l-len,  fa. support))  as  t\,  F  fa 

where  {t\.TJtem\  =  fa.item\  or  t\.Tjrulem  <  1)  and 

{t\.TJtemk  —  fa-iterrik  or  ti.Tjrulem  <  k)  and 

t\.T-rulem  =  fa.len  and 

£i.T_supj9or/,//2. support  >  :minconf 

item  1  ,...itemk,  len,  rulem, 
confidence,  support 

t 
conf  >  rminconf 

t 


Table  function 

GenRules 

t 


Figure  3.4.  Rule  generation 
3.5     Support  Counting  Using  SQL-92 
We  present  four  approaches  in  this  category. 
3.5.1     K-way  Join 

In  each  pass  k,  we  join  the  candidate  itemsets  C*  with  k  transaction  tables  T  and 
follow  it  up  with  a  group  by  on  the  itemsets  as  shown  in  Figure  3.5. 

Figure  3.5  also  shows  a  tree  diagram  of  the  query.  These  tree  diagrams  are  not  to  be 
confused  with  the  plan  trees  which  could  look  quite  different. 
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insert  into  F^  select  item\,  . .  .item^,  count(*) 
from  Ck)Th,...Ttk 

where         fi.item  =  Ck-item\  and 

ifc.item  =  Ck.itemk  and 
ii.tid  =  ^tid  and 


ijt_i.tid  =  tk-tld 
group  by    item\,item2  .  ■  ■  item^ 
having        count(*)  >  :minsup 


having 
count(*)  >  :minsup 


Group  by 
itemL....,itcmk 


Ck.iieml  =  t  Litem 


Ck.itemk  =  tk.item 


1 1  .tid  =  tk.tid     £><C] 


[><] 


II. lid  m  |2  lid 


X 


Figure  3.5.  Support  counting  by  K-way  join 

This  SQL  computation,  when  merged  with  the  candidate  generation  step,  is  similar  to 
the  one  proposed  in  Tsur  et  al.  [128]  as  a  possible  mechanism  to  implement  query  flocks. 
In  Section  3.7,  we  discuss  the  different  execution  plans  for  this  query  and  the  related 
performance  issues. 

3.5.2     Three-way  Join 

The  above  approach  requires  (k  +  l)-way  joins  in  the  Arth  pass.  We  can  reduce  the 
cardinality  of  joins  to  3  using  the  following  approach  which  bears  some  resemblance  to 
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the  AprioriTid  algorithm  in  Agrawal  et  al.  [9].  Each  candidate  itemset  Ck,  in  addition 
to  attributes  (itemi, . . .  ,itemk)  has  three  new  attributes  (aid,  id\ , id2) .  aid  is  a  unique 
identifier  associated  with  each  itemset  and  id\  and  id2  are  aids  of  the  two  itemsets  in  Fk_\ 
from  which  the  itemset  in  Ck  was  generated  (as  discussed  in  Section  3.4.1).  In  addition, 
in  the  kth  pass  we  generate  a  new  copy  of  the  data  table  Tjt  with  attributes  (tid,  old)  that 
keeps  for  each  tid  the  oid  of  each  itemset  in  Ck  that  it  supported.  For  support  counting, 
we  first  generate  Tk  from  Tk-\  and  Ck  and  then  do  a  group-by  on  Tk  to  find  Fk  as  follows. 

insert  into  Tk  select  <i .tid,  oid 
from  Cfc,Tjfc_i  t\,Tk-\  t2 

where         ti.oid  =  Ck-id\  and  <2-oid  =  Ck-id2  and  ii .tid  =  t2.ti<i 

insert  into  Fk  select  oid,  iterny,  . .  .iterrik,  cnt 
from  Ck, 

(select  oid  as  cid,  count(*)  as  cnt  from  T^ 

group  by  oid  having  count(*)  >  :minsup)  as  temp 
where  Cfc.oid  =  cid 

3.5.3     Two  Group-by 

Another  way  to  avoid  multi-way  joins  is  to  first  join  T  and  Ck  based  on  whether  the 
"item"  of  a  (tid,  item)  pair  of  T  is  equal  to  any  of  the  k  items  of  Ck,  then  do  a  group  by  on 
(itemi, . . .  ,item,k,tid)  filtering  tuples  with  count  equal  to  k.  This  gives  all  (itemset,  tid) 
pairs  such  that  the  tid  supports  the  itemset.  Finally,  as  in  the  previous  approaches,  do  a 
group-by  on  the  itemset  [itemi,. . .  ,itemk)  filtering  tuples  that  meet  the  support  condition. 

insert  into  Fk  select  itemi,. . .  itemk,  count(*) 


:w> 


from  (select  iterrii, . . .  iterrik,  count(*) 

from     T,  Ck 
where   item  =  Ck-itemi  or 

item  =  Ck-iterrik 

group  by  itemi, . . .  ,item,k,  tid 

having  count(*)  =  k)  as  temp 
group  by    item\, . . . ,  iterrik 
having        count(*)  >  :minsup 

3.5.4     Subquerv-Based 

This  approach  makes  use  of  common  prefixes  between  the  itemsets  in  Ck  to  reduce  the 
amount  of  work  done  during  support  counting.  We  break  up  the  support  counting  phase 
into  a  cascade  of  k  subqueries.  The  /-th  subquery  Qi  finds  all  tids  that  match  the  distinct 
itemsets  formed  by  the  first  /  columns  of  Ck  (call  it  di).  The  output  of  Qi  is  joined  with  T 
and  di+i  (the  distinct  itemsets  formed  by  the  first  I  +  1  columns  of  Ck)  to  get  Qi+\.  The 
final  output  is  obtained  by  doing  a  group-by  on  the  k  items  to  count  support  as  above.  The 
queries  and  the  tree  diagram  for  subquery  Qi  are  given  in  Figure  3.6. 

3.6     Performance  Comparison 

In  this  section  we  compare  the  performance  of  the  four  SQL-92  approaches.  All  of 
our  experiments  were  performed  on  Version  5  of  IBM  DB2  Universal  Server  installed  on  a 
RS/6000  Model  140  with  a  200  MHz  CPU,  256  MB  main  memory  and  a  9  GB  disc  with  a 
measured  transfer  rate  of  8  MB/s.  These  experimental  results  are  also  reported  in  Sarawagi 
et  al.  [109]. 


:j<i 


insert  into  Fk  select  itemi, . . .  ,itemk,  count(*) 
from  (Subquery  Q^)  t 
group  by  item\,item.2  . .  .item^ 
having  count(*)  >  :minsup 

Subquery  Qi  (for  any  /  between  1  and  k): 
select  itemi, . . .  itemi,  tid 
from  T  ti,  (Subquery  Qi~\)  as  r;_i, 

(select  distinct  itemi  ■  ■  ■  itemi  from  Ck)  as  di 
where  r(_i.i£emi  =  di.itemi  and  . . .  and 
ri^\.itemi-\  =  d[.itemi_iand 
ri-i.tid  —  ti.tid  and 
ti.item  =  d[. itemi 

Subquery  Qq:   No  subquery  Qq- 


Subquery  Q_l 

t 
itemi,. ...itenil.  tid 


tl.item  =  di.itemi 

r 1- 1 .  itemi        =  di.itemi 

r_l-l.item_l-l  =dl.item_l-l     P><C] 


X 


Ttl 


select  distinct 
itemi,.  .., itemi 

t 
Ck 


Tree  diagram  for  Subquery  Qi 


Figure  3.6.  Support  counting  using  subqueries 


Table  3.1.  Description  of  different  real-life  datasets 


Datasets 

#  Records 

#  Transactions 

#  Items 

Avg.#items 

in  millions 

in  millions 

in  thousands 

Dataset-A 

2.5 

0.57 

85 

4.4 

Dataset-B 

7.5 

2.5 

15.8 

2.62 

Dataset-C 

6.6 

0.21 

15.8 

31 

Dataset-D 

14 

1.44 

480 

9.62 
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We  selected  a  collection  of  four  real-life  datasets  obtained  from  various  mail-order 
companies  and  retail  stores  for  the  experiments.  These  datasets  have  differing  values  of 
parameters  like  the  total  number  of  (tid,item)  pairs,  the  number  of  transactions  (tids),  the 
number  of  items  and  the  average  number  of  items  per  transaction.  Table  3.1  summarizes 
these  parameters. 

In  this  dissertation,  we  report  the  performance  with  only  Dataset-A.  The  overall 
observation  was  that  mining  implementations  in  pure  SQL-92  are  too  slow  to  be  practical. 
For  these  experiments  we  built  a  composite  index  {item\ . . .  item^)  on  Cfc,  k  different  indices 
on  each  of  the  k  items  of  Ck  and  a  (tid,item)  and  a  (item,tid)  index  on  the  data  table. 
The  goal  was  to  let  the  optimizer  choose  the  best  plan  possible.  We  do  not  include  the 
index  building  cost  in  the  total  time. 

In  Figure  3.7  we  show  the  total  time  taken  by  the  four  approaches:  KwayJoin,  3wayJoin, 
Subquery  and  2GroupBy.  For  comparison,  we  also  show  the  time  taken  by  the  Loose-coupling 
approach  because  this  is  the  approach  currently  used  by  existing  systems.  The  graph  shows 
the  total  time  split  into  candidate  generation  time  (Cgen)  and  the  time  for  each  pass.  The 
candidate  generation  time  and  the  time  for  the  first  pass  are  much  smaller  compared  to  the 
total  time.  From  these  set  of  experiments  we  can  make  the  following  observations. 
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Figure  3.7.  Comparison  of  different  SQL-92  approaches 

•  The  best  approach  in  the  SQL-92  category  is  the  Subquery  approach.  An  important 
reason  for  its  performing  better  than  the  other  approaches  is  exploitation  of  common 
prefixes  between  candidate  itemsets.  None  of  the  other  three  approaches  uses  this 
optimization.  Although  the  Subquery  approach  is  comparable  to  the  Loose-coupling 
approach  in  some  cases,  for  other  cases  it  did  not  complete  even  after  taking  ten  times 
more  time  than  the  Loose-coupling  approach. 

•  The  2GroupBy  approach  is  significantly  worse  than  the  other  three  approaches  because 
it  involves  an  index-ORing  operation  on  k  indices  for  each  pass  k  of  the  algorithm. 
In  addition,  the  inner  group-by  requires  sorting  a  large  intermediate  relation.  The 
outer  group-by  is  comparatively  faster  because  the  sorted  result  is  of  size  at  most  Ck 
which  is  much  smaller  than  the  result  size  of  the  inner  group-by.  The  DBMS  does 
aggregation  during  sorting  therefore  the  size  of  the  result  is  an  important  factor  in 
the  total  cost. 
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•  The  3wayJoin  approach  is  comparable  to  the  KwayJoin  approach  for  this  dataset 
because  the  number  of  passes  is  at  most  three.  As  shown  in  Agrawal  et  al.  [9]  there 
might  be  other  datasets  especially  ones  where  there  is  significant  reduction  in  the  size 
of  Tjt  as  k  increases  where  3wayJoin  might  perform  better  than  KwayJoin.  However, 
one  disadvantage  of  the  3wayJoin  approach  is  that  it  requires  space  to  store  and  log 
the  temporary  relations  7*  generated  in  each  pass. 

3.7     Cost  Analysis 

In  this  section,  we  analyze  the  cost  of  the  KwayJoin  and  Subquery  approaches  which 
represent  the  better  ones  among  the  SQL-92  approaches.  A  relational  query  engine  can 
execute  the  KwayJoin  query  in  several  different  ways  and  the  performance  depends  on  the 
execution  plan  chosen.  We  experimented  with  a  number  of  alternative  execution  plans 
for  this  query.  We  could  force  the  query  processor  to  choose  different  plans  by  creating 
different  indices  on  T  and  Ck,  and  in  some  cases  by  disabling  certain  join  methods  in  our 
experiments  using  Postgres.  We  will  elaborate  more  on  this  in  Section  3.9. 

We  present  two  different  execution  plans — one  with  Ck  as  the  outermost  relation  in 
Section  3.7.1  and  another  with  Ck  as  the  innermost  relation  in  Section  3.7.2 — and  the  cost 
analysis  for  them.  The  effect  of  subquery  optimization  on  the  cost  estimates  is  outlined  in 
Section  3.7.3. 

A  schematic  diagram  of  the  two  different  execution  plans  are  given  in  Figures  3.8  and 
3.9.  In  the  cost  analysis,  we  use  the  mining-specific  data  parameters  and  knowledge  about 
association  rule  mining  (Apriori  algorithm  [9]  in  this  case)  to  estimate  the  cost  of  joins  and 
the  size  of  join  results.  Even  though  current  relational  optimizers  do  not  use  this  mining- 
specific  semantic  information,  the  analysis  provides  a  basis  for  developing  "mining-aware" 
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Figure  3.8.  K-way  join  plan  with  Ck  as  inner  relation 
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t2.lid  =tl.tid         Ur^tJ 


Ck.iteml  =  tl.item    L^><CJ 


Figure  3.9.  K-way  join  plan  with  Ck  as  outer  relation 

optimizers.  The  cost  formulae  are  presented  in  terms  of  operator  costs  in  order  to  make 
them  general;  for  instance  join(p,  q,  r)  denotes  the  cost  of  joining  two  relations  of  size  p 
and  q  to  get  a  result  of  size  r.  The  actual  cost  which  is  based  on  the  join  method  used 
and  the  system  parameters  can  be  derived  by  substituting  the  appropriate  values  in  the 
given  formulae.  The  data  parameters  and  operators  used  in  the  analysis  are  summarized 
in  Table  3.2. 

3-7.1 KWavJoin  Plan  with  CV  as  Outer  Relation 

Start  with  Ck  as  the  outermost  relation  and  perform  a  series  of  joins  with  the  k  copies 
of  T.  The  final  join  result  is  grouped  on  the  k  items  to  find  the  support  counts.  The 
choice  of  join  methods  for  each  of  the  intermediate  joins  depends  on  factors  such  as  the 
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Table  3.2.  Notations  used  in  cost  analysis 


R 

T 
A" 
Fi 
S(C) 

Rf 

Nf 
Ck 
C{n,k) 

tk 

group(n,  m) 
Join(p,  q,  r) 
blob(n) 


number  of  records  in  the  input  transaction  table 

number  of  transactions 

average  number  of  items  per  transaction  =  ^ 

number  of  frequent  items 

sum  of  support  of  each  itemset  in  set  C 

average  support  of  a  frequent  fc-itemset  =  sSpk) 

number  of  records  out  of  R  involving  frequent  items  =  S{F\ ) 

average  number  of  frequent  items  per  transaction  =  -^- 

number  of  candidate  k-itemsets 

number  of  combinations  of  size  k  possible  out  of  a  set  of  size  n:  =  , !(.  n[,v 

cost  of  generating  a  k  item  combination  using  table  function  Comb-k 

cost  of  grouping  n  records  out  of  which  m  are  distinct 

cost  of  joining  two  relations  of  size  p  and  q  to  get  a  result  of  size  r 

cost  of  passing  a  BLOB  of  size  n  integers  as  an  argument 


availability  of  indices,  the  size  of  intermediate  results,  and  the  amount  of  available  memory. 
For  instance,  the  efficient  execution  of  nested  loops  joins  require  an  index  {item,  tid)  on 
T.  If  the  intermediate  join  result  is  large,  it  could  be  advantageous  to  materialize  it  and 
perform  sort-merge  join. 

For  each  candidate  itemset  in  Ck,  the  join  with  T  produces  as  many  records  as  the 
support  of  its  first  item.  Therefore,  the  size  of  the  join  result  can  be  estimated  to  be  the 
product  of  the  number  of  ^-candidates  and  the  average  support  of  a  frequent  item  and 
hence  the  cost  of  this  join  is  given  by  join^,  R,  C*  *  si).  Similarly,  the  relation  obtained 
after  joining  Ck  with  I  copies  of  T  contain  as  many  records  as  the  sum  of  the  support 
counts  of  the  /-item  prefixes  of  the  candidate  fc-itemsets.  Hence  the  cost  of  the  Ith  join 
is  join(Cjt  *  st_i,  R,  Ck  *  sj)  where  s0  =  1.  Note  that  values  of  the  Sj's  can  be  computed 
from  statistics  collected  in  the  previous  passes.  Cost  of  the  last  join  (with  Tk)  cannot  be 
estimated  using  the  above  formula  since  the  A;-item  prefix  of  a  ^-candidate  is  not  frequent 
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and  we  do  not  know  the  value  of  s*.  However,  the  last  join  produces  S(Ck)  records- 
there  will  be  as  many  records  for  each  candidate  as  its  support — and  therefore,  the  cost  is 
}o'm(Ck  *  Sfc-i,  R,  S{Ck))-  S(Ck)  can  be  estimated  by  adding  the  support  estimates  of  all 
the  itemsets  in  Ck-  A  good  estimate  for  the  support  of  a  candidate  itemset  is  the  minimum 
of  the  support  counts  of  all  its  subsets.  The  overall  cost  of  this  plan  expressed  in  terms  of 
operator  costs  is 


fc-i 

{^join(Cfc  *  s,_!,  R,  Ck  *  st)}  +join(Cfc  *  sk-i,  R,  S{Ck))  +  gionp{S(Ck),  Ck) 
1=1 


3.7.2     KWavJoin  Plan  with  CV  as  Inner  Relation 

In  this  plan,  we  join  the  k  copies  of  T  and  the  resulting  fc-item  combinations  are  joined 
with  Ck  to  filter  out  non-candidate  item  combinations.  The  final  join  result  is  grouped  on 
the  &-items. 

The  result  of  joining  /  copies  of  T  is  the  set  of  all  possible  /-item  combinations  of 
transactions  and  there  are  C(N,  I)  *T  such  combinations.  We  know  that  the  items  in  the 
candidate  itemset  are  lexicographically  ordered  and  hence  we  can  add  extra  join  predicates 
as  shown  in  Figure  3.8  to  limit  the  join  result  to  /-item  combinations  (without  these  extra 
predicates  the  join  will  result  in  /-item  permutations).  When  Ck  is  the  outermost  relation 
these  predicates  are  not  required.  A  mining-aware  optimizer  should  be  able  to  rewrite  the 
query  appropriately.  The  Ith  join  produces  (/  +  l)-item  combinations  and  therefore,  its  cost 
is  join(C(N,l)  *  T,  R,  C{N,l  +  1)  *  T).  The  last  join  produces  S{Ck)  records  as  in  the 
previous  case  and  hence  its  cost  is  }om(C{N,  k)  *  T,  Ck,  S{Ck)).   The  overall  cost  of  this 
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plan  is 


fc    1 
{]Tjoin(C(iV,Z)*T,  R,  C(N,l  +  l)*T)}+jom(C(N,k)*T,  Ck,  S(Ck))  +  gvoup(S(Ck),  Ck) 


Note  that  in  the  above  expression  C(N,  1)  *  T  =  R. 

3.7.3     Effect  of  Subquery  Optimization 

The  subquery  optimization  makes  use  of  common  prefixes  among  candidate  itemsets. 
Unfolding  all  the  subqueries  will  result  in  a  query  tree  which  structurally  resembles  the 
KwayJoin  plan  tree  shown  in  Figure  3.9.  Subquery  Qi  produces  dlk  *  si  records  where  dk 
denotes  the  number  of  distinct  j  item  prefixes  of  itemsets  in  Ck.  In  contrast,  the  Ith  join  in 
the  KwayJoin  plan  results  in  Typically  dlk  is  much  smaller  compared  to  Ck  which  explains 
why  the  Subquery  approach  performs  better  than  the  KwayJoin  approach.  Ck  *  S(  records. 
The  output  of  subquery  Qk  contains  S(Ck)  records.  The  total  cost  of  this  approach  can  be 
estimated  to  be 


{X  trijoin(i?,  s,_i  *  d[  \  4,  si  *  4)}  +  group(S(C*),  Ck) 


where  tr[jo'm(p,q,r,s)  denotes  the  cost  of  joining  three  relations  of  size  p,q,r  respectively 
producing  a  result  of  size  s.  The  value  of  sk  which  is  the  average  support  of  a  frequent 
A;-itemset  can  be  estimated  as  mentioned  in  section  3.7.1. 

The  experimental  results  presented  in  Section  3.6  and  in  Sarawagi  et  al.  [110]  shows 
that  the  subquery  optimization  gave  much  better  performance  than  the  basic  KwayJoin 
approach  (an  order  of  magnitude  better  in  some  cases).  We  observed  the  same  trend  in  our 
additional  experiments  using  synthetic  datasets.  We  used  synthetic  datasets  for  some  of  the 
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experiments  because  the  real-life  datasets  were  not  available  outside  IBM.  The  synthetic 
datasets  used  in  our  experiments  are  detailed  below.  The  main  reason  for  the  subquery 
approach  performing  better  is  that  the  number  of  distinct  /-item  prefixes  is  much  less 
compared  to  the  total  number  of  candidate  itemsets  which  results  in  joins  between  smaller 
tables.  The  number  of  candidate  itemsets  and  the  corresponding  distinct  item  prefixes  for 
various  passes  in  one  of  our  experiments  is  given  in  Figure  3.10.  These  numbers  are  for  the 
dataset  T10.I4.D100K  and  0.33%  support.  Note  that  d£  is  not  shown  since  it  is  the  same  as 
Cfc.  In  pass  3,  Cz  contains  2453  itemsets  where  as  d\  has  only  295  1-item  prefixes  (almost 
a  factor  of  10  less  than  C3).  This  results  in  correspondingly  smaller  intermediate  tables  as 
shown  in  the  analysis  above,  which  is  the  key  to  the  performance  gain. 


iiiiiiiiiiiiiiiiininiMTi 


C3d3'd3a  C.  d*' — d„*  Cgda' dB* 


Figure  3.10.  Number  of  candidate  itemsets  vs  distinct  item  prefixes 


Experimental  datasets  We  used  synthetic  data  generated  according  to  the  procedure 
explained  in  Agrawal  and  Srikant  [13]  for  some  of  the  experiments.  The  results  reported 
here  are  for  the  datasets  T5.I2.D100K  and  T10.I4.D100K.  The  first  dataset  consists  of  100 
thousand  transactions,  each  containing  an  average  of  5  items.    The  average  size  of  the 
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Table  3.3.  Description  of  synthetic  datasets 


Datasets 

#  Records 

#  Transactions 

#  Items 

Avg.#  items 

T5.I2.D100K 
T10.I4.D100K 

546651 
1154995 

100000 
100000 

1000 
1000 

5 
10 

maximal  potentially  frequent  itemsets  (denoted  as  I)  is  2.  The  transaction  table  corre- 
sponding to  this  dataset  had  approximately  550  thousand  records.  The  second  dataset 
has  100  thousand  transactions,  each  containing  an  average  of  10  items  (total  of  about  1.1 
million  records)  and  the  average  size  of  maximal  potentially  frequent  itemsets  is  4.  The 
different  parameters  of  the  two  datasets  are  summarized  in  Table  3.3. 

The  experiments  reported  in  the  rest  of  this  chapter  were  performed  on  PostgreSQL 
Version  6.3  [100],  a  public  domain  DBMS,  installed  on  a  8  processor  Sun  Ultra  Enterprise 
4000/5000  with  248  MHz  CPUs  and  256  MB  main  memory  per  processor,  running  Solaris 
2.6.  Note  that  PostgreSQL  is  not  parallelized.  It  supports  nested  loops,  hash-based  and 
sort-merge  join  methods  and  provides  finer  control  of  the  optimizer  to  disable  any  of  the 
join  methods.  We  have  found  it  to  be  a  useful  platform  for  studying  the  performance  of 
different  join  methods  and  execution  plans. 

3.8     Performance  Optimizations 

The  cost  analysis  presented  above  provides  some  insight  into  the  different  components 
of  the  execution  time  in  the  different  passes  and  what  can  be  optimized  to  achieve  better 
performance.  In  this  section,  we  present  three  optimizations  to  the  KwayJoin  approach 
(other  than  the  subquery  optimization)  and  discuss  how  they  impact  the  cost.  Based  on 
these  optimizations,  we  develop  the  Set-oriented  Apriori  approach  in  Section  3.8.4. 
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3.8.1     Pruning  Non- Frequent  Items 

The  size  of  the  transaction  table  is  a  major  factor  in  the  cost  of  joins  involving  T. 
It  can  be  reduced  by  pruning  the  non-frequent  items  from  the  transactions  after  the  first 
pass.  We  store  the  transaction  data  as  (tid,  item)  tuples  in  a  relational  table  and  hence 
this  pruning  means  simply  dropping  the  tuples  corresponding  to  non-frequent  items.  This 
can  be  achieved  by  joining  T  with  the  frequent  items  table  Fi  as  follows. 

insert  into  Tj  select  t.tid,  t.item 
from  T  t,  Fi  f 

where  t.item  =  f.item 

We  insert  the  pruned  transactions  into  table  Tj  which  has  the  same  schema  as  that  of  T. 
In  the  subsequent  passes,  joins  with  T  can  be  replaced  with  corresponding  joins  with  T*. 
This  could  result  in  improved  performance  especially  for  higher  support  values  where  the 
frequent  item  selectivity  is  low,  since  we  join  smaller  tables.  For  some  of  the  synthetic 
datasets  we  used  in  our  experiments,  this  pruning  reduced  the  size  of  the  transaction  table 
to  about  half  its  original  size.  This  could  be  even  more  useful  for  real-life  datasets  which 
typically  contains  lots  of  non-frequent  items.  For  example,  some  of  the  real-life  datasets 
used  for  the  experiments  reported  in  Sarawagi  et  al.  [109]  contained  of  the  order  of  100 
thousand  items  out  of  which  only  a  few  hundreds  were  frequent.  Figure  3.11  shows  the 
reduction  in  transaction  table  size  due  to  this  optimization  for  our  experimental  datasets. 
The  initial  size  (R)  and  the  size  after  pruning  (Rf)  for  different  support  values  are  shown. 
With  this  optimization,  in  the  cost  formulae  of  section  3.7,  R  can  be  replaced  with 
Rf — the  number  of  records  in  T  involving  frequent  items  and  N  with  Nf — the  average 
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Figure  3.11.  Reduction  in  transaction  table  size  by  non- frequent  item  pruning 
number  of  frequent  items  per  transaction.   Note  that  with  the  subquery  optimization  we 
could  achieve  significant  performance  improvements  by  reducing  the  relation  sizes  of  joins. 
3-8.2 Eliminating  Candidate  Generation  in  Second  Pass 

This  optimization  aims  at  reducing  the  cost  of  the  second  pass  which  is  a  significant 
portion  of  the  total  cost  in  some  cases.  In  the  second  pass,  C2  is  almost  a  cartesian  product 
of  the  two  Fxs  used  to  generate  it  and  hence  materializing  it  and  joining  with  the  T's  (or 
Tf's)  could  be  expensive.  In  order  to  circumvent  the  problem  of  counting  the  large  C2, 
most  mining  algorithms  use  special  techniques  in  the  second  pass.  A  few  examples  are 
two-dimensional  arrays  in  IBM's  Quest  data  mining  system  [2]  and  hash  filters  proposed  in 
Park  et  al.  [99]  to  limit  the  size  of  C2.  The  generation  of  C2  can  be  completely  eliminated 
by  formulating  the  join  query  to  find  F2  as 

insert  into  F2  select  p.item,  q.item,  count(*) 
from  Tf  p,  Tf  q 

where         p.tid  =  q.tid  and  p.item  <  q.item 
group  by    p.item,  q.item 
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having        count(*)  >  rminsup 
The  cost  of  second  pass  with  this  optimization  is 

join(i?/;  Rf,  C(Nf,  2))+group(C(i\T/l  2),C(FU  2)) 

Even  though  the  grouping  cost  remains  the  same,  there  is  a  big  reduction  from  the  basic 
KwayJoin  approach  in  the  join  costs.  Figure  3.12  compares  the  running  time  of  the  sec- 
ond pass  with  this  optimization  to  the  basic  KwayJoin  approach  for  the  two  experimental 
datasets.  For  the  KwayJoin  approach,  the  best  execution  plan  was  the  one  which  generates 
all  2-item  combinations,  joins  them  with  the  candidate  set  and  groups  the  join  result.  We 
can  see  that  this  optimization  has  a  significant  impact  on  the  running  time. 

3.8.3  Reusing  the  Item  Combinations  from  Previous  Pass 

The  SQL  formulations  of  association  rule  mining  is  based  on  generating  item  combi- 
nations in  various  ways  and  similar  work  is  performed  in  all  the  different  passes.  Therefore, 
reusing  the  item  combinations  will  improve  the  performance  especially  in  the  higher  passes. 
We  will  explain  more  about  this  optimization  and  the  corresponding  cost  reduction  in  the 
next  section. 

3.8.4  Set-Oriented  Apriori 

In  this  section,  we  develop  a  set-oriented  version  of  the  apriori  algorithm  combining 
all  the  performance  optimizations  outlined  above,  which  we  call  Set-oriented  Apriori.  We 
store  the  item  combinations  generated  in  each  pass  and  use  them  in  the  subsequent  passes 
instead  of  generating  them  in  every  pass  from  the  transaction  tables.  In  the  kth  pass  of 
the  support  counting  phase,  we  generate  a  table  Tk  which  contains  all  A;-item  combinations 
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Figure  3.12.  Benefit  of  second  pass  optimization 

that  are  candidates.  Tk  has  the  schema  (tid,  itemi ,... ,  itemk).  We  join  Tfc_i,  Tj  and  Ck  as 
shown  below  to  generate  Tk.  A  tree  diagram  of  the  query  is  also  given  in  Figure  3.13.  The 
frequent  itemsets  Fk  is  obtained  by  grouping  the  tuples  of  Tk  on  the  k  items  and  applying 
the  minimum  support  filtering. 

We  can  further  prune  Tk  by  filtering  out  item  combinations  that  turned  out  to  be 
non-frequent.  However,  this  is  not  essential  since  we  join  it  with  the  candidate  set  Ck+\  in 
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insert  into  Tk 

select  p.tid,  p.itemi,  . .  .p.itemk-\,  q.item 

from  Ck,  Tfc_!  p,  Tf  q 

where         p.itemi  —  Ck-itemi  and 

p.iterrik-i  =  Ck-iterrik-i  and 
q.item    =  Ck-itemk  and 
p.tid       =  q.tid 


Tk 
p.itemi  -Ck.iteml 


t 

p.item_k-l  =  Ck.item_k-1       P]><rl 
q.item  =  Ckitemk 


p.tid  =  q.tid  Ck 

p.item_k-l  <  q.item 


IX 


T_k-1    p  Tl    q 

Figure  3.13.  Generation  of  Tk 

the  next  pass  to  generate  Tjt+1.  The  only  advantage  of  pruning  Tk  is  that  we  will  have  a 
smaller  table  to  join  in  the  next  pass;  but  at  the  additional  cost  of  joining  Tk  with  Fk. 

We  use  the  optimization  discussed  above  for  the  second  pass  and  hence  do  not  mate- 
rialize and  store  the  2-item  combinations  T2.  Therefore,  we  generate  T3  directly  by  joining 
Tf  with  C3  as 

insert  into  T3  select  p.tid,  p.item,  q.item,  r.item 
from  Tf  p,  Tf  q,  Tf  v,  Ck 

where         p.item  =  Cz-itemi  and  q.item  =  C3.item2  and  r.item  —  C^.items  and 
p.tid  =  q.tid  and  q.tid  =  x.tid 
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We  can  also  use  the  Subquery  approach  to  generate  T3  if  that  is  estimated  to  be  less 
expensive.  T3  will  contain  exactly  the  same  tuples  produced  by  subquery  Q3. 

The  Set-oriented  Apriori  algorithm  bears  some  resemblance  with  the  three-way  join 
approach  in  Section  3.5.2  and  the  AprioriTid  algorithm  in  Agrawal  and  Srikant  [13].  In  the 
three-way  join  approach,  the  temporary  table  Tk  stores  for  each  transaction,  the  identifiers 
of  the  candidates  it  supported.  Tk  is  generated  by  joining  two  copies  of  Tjt_i  with  Ck.  The 
generation  of  Fk  requires  a  further  join  of  Tk  with  Ck.  AprioriTid  makes  use  of  special 
data  structures  which  are  difficult  to  maintain  in  the  SQL  formulation. 

Cost  Comparison 

In  Section  3.7.3,  we  saw  that  the  cost  of  the  kth  pass  with  the  subquery  optimization 

is 

it 
{J2  trijoinOR,  a,_!  *  djr1,  d[,  s,  *  4)}  +  group(5(Cfe),  Ck) 
1=1 

As  a  result  of  the  materialization  and  reuse  of  item  combinations,  Set-oriented  Apriori  re- 
quires only  a  single  3- way  join  in  the  kth  pass  * .  The  cost  of  the  kth  pass  in  Set-oriented 
Apriori  is 

trijoinCR/,  Tk.u  Ck,  5(Cfc))  +group(5(Cjt),  Ck) 

where  Tk-\  and  Ck  denote  the  cardinality  of  the  corresponding  tables.  The  grouping  cost 
is  the  same  as  that  of  the  subquery  approach.  The  table  Tk-\  contains  exactly  the  same 
tuples  as  that  of  subquery  Q/fc_1  and  hence  has  a  size  of  s;_i  *  d1^1 .  Also,  dkk  is  the  same 
as  Ck.    Therefore,  the  kth  pass  cost  of  Set-oriented  Apriori  is  the  same  as  the  kth  term  in 


Note  that  this  may  be  executed  as  two  2-way  joins  since  3-way  joins  are  not  generally  supported  in 
current  relational  systems. 
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the  join  cost  summation  of  the  subquery  approach.  This  results  in  significant  performance 
improvements  especially  in  the  higher  passes. 

Figure  3.14  compares  the  running  times  of  the  subquery  and  Set-oriented  Apriori  ap- 
proaches for  the  dataset  T10.I4.D100K  for  0.33%  support.  We  show  only  the  times  for 
passes  3  and  higher  since  both  the  approaches  are  the  same  in  the  first  two  passes. 
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Figure  3.14.  Benefit,  of  reusing  item  combinations 


Space  Overhead 

The  Set-oriented  Apriori  approach  requires  additional  space  in  order  to  store  the  item 
combinations  generated.  The  size  of  the  table  7jt  is  the  same  as  S(Ck),  which  is  the  total 
support  of  all  the  k-item  candidates.  Assuming  that  the  tid  and  item  attributes  are  integers, 
each  tuple  in  Tk  consists  of  A;  +  1  integer  attributes.  Figure  3.15  shows  the  space  required 
to  store  Tk  in  terms  of  number  of  integers,  for  the  dataset  T10.I4.D100K  for  two  different 
support  values.  The  space  needed  for  the  input  data  table  T  is  also  shown  for  comparison. 
T2  is  not  shown  in  the  graph  since  we  do  not  materialize  and  store  it  in  the  Set-oriented 
Apriori  approach.  Note  that  once  Tk  is  materialized  Tk_i  can  be  deleted  unless  it  needs  to 
be  retained  for  some  other  purposes. 
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Figure  3.15.  Space  requirements  of  the  set-oriented  apriori  approach 

In  the  earlier  passes,  if  the  number  of  item  combinations  is  too  large  and  we  do  not 
want  to  store  them,  the  Subquery  approach  can  be  used  instead.  However,  the  Subquery 
approach  will  also  generate  large  intermediate  tables  except  in  cases  where  the  output  of  an 
intermediate  join  is  pipelined  to  the  subsequent  operation.  The  transition  to  Set-oriented 
Apriori  can  be  made  in  any  pass  by  materializing  the  item  combinations.  An  advantage  of 
the  Set-oriented  Apriori  is  that  it  requires  only  simple  joins  and  hence  it  is  easier  for  the 
optimizer  to  handle  them.  With  multiple  levels  of  nested  subqueries,  optimization  becomes 
much  harder. 

3.9     Performance  Experiments  with  Set-Oriented  Apriori 

In  this  section,  we  compare  the  total  running  time  of  the  Subquery  and  Set-oriented 
Apriori  approaches. 

We  studied  the  performance  of  the  Subquery  approach  (the  best  SQL-92  approach  in 
Section  3.6)  and  the  Set-oriented  Apriori  for  a  wide  range  of  data  parameters  and  support 
values.  We  report  the  results  on  two  of  the  datasets— T5.I2.D100K  and  T10.I4.D100K— 
which  are  described  in  Section  3.7.3. 
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T10.I4.D100K:  Total  time 
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Figure  3.16.  Comparison  of  Subquery  and  Set-oriented  Apriori  approaches 

In  Figure  3.16,  we  show  the  relative  performance  of  Subquery  and  Set-oriented  Apriori 
approaches  for  the  two  datasets.  The  chart  shows  the  total  time  taken  for  each  of  the 
different  passes.  Set-oriented  Apriori  performs  better  than  Subquery  for  all  the  support 
values.  The  first  two  passes  of  both  the  approaches  are  similar  and  they  take  approximately 
equal  amount  of  time.  The  difference  between  Set-oriented  Apriori  and  Subquery  widens  for 
higher  numbered  passes  as  explained  in  Section  3.8.4.      For  T5.I2.D100K,  F2  was  empty 
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for  support  values  higher  than  0.3%  and  therefore  we  chose  lower  support  values  to  study 
the  relative  performance  in  higher  numbered  passes. 

In  some  cases,  the  optimizer  did  not  choose  the  best  plan.  For  example,  for  joins 
with  T  {Tj  for  Set-oriented  Apriori),  the  optimizer  chose  nested  loops  plan  using  (item,  tid) 
index  on  T  in  many  cases  where  the  corresponding  sort-merge  plan  was  faster;  an  order 
of  magnitude  faster  in  some  cases.  We  were  able  to  experiment  with  different  plans  by 
disabling  certain  join  methods  (disabling  nested  loops  join  for  the  above  case).  We  also 
broke  down  the  multi-way  joins  in  some  cases  into  simpler  two-way  joins  to  study  the 
performance  implications.  The  reported  times  correspond  to  the  best  join  order  and  join 
methods. 

Figure  3.17  shows  the  CPU  time  and  I/O  time  taken  for  the  dataset  T10.I4.D100K. 
The  two  approaches  show  the  same  relative  trend  as  the  total  time.  However,  it  should  be 
noted  that  the  I/O  time  is  less  than  one  third  of  the  CPU  time.  This  shows  that  there 
is  a  need  to  revisit  the  traditional  optimization  and  parallelization  strategies  designed  to 
optimize  for  I/O  time,  in  order  to  handle  the  newer  decision  support  and  mining  queries 
efficiently. 

3.9.1     Scale-up  Experiment 

We  experimented  with  several  synthetic  datasets  to  study  the  scale-up  behavior  of  Set- 
oriented  Apriori  with  respect  to  increasing  number  of  transactions  and  increasing  transaction 
size.  Figure  3.18  shows  how  Set-oriented  Apriori  scales  up  as  the  number  of  transactions  is 
increased  from  10,000  to  1  million.  We  used  the  datasets  T5.I2  and  T10.I4  for  the  average 
sizes  of  transactions  and  itemsets  respectively.  The  minimum  support  level  was  kept  at  1%. 
The  first  graph  shows  the  absolute  execution  times  and  the  second  one  shows  the  times 
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T10.I4.D100K:  CPU  time 
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Figure  3.17.  Comparison  of  CPU  and  I/O  times 

normalized  with  respect  to  the  times  for  the  10,000  transaction  datasets.  It  can  be  seen 
that  the  execution  times  scale  quite  linearly  and  both  the  datasets  exhibit  similar  scale-up 
behavior. 

The  scale-up  with  increasing  transaction  size  is  shown  in  Figure  3.19.  In  these  exper- 
iments we  kept  the  physical  size  of  the  database  roughly  constant  by  keeping  the  product 
of  the  average  transaction  size  and  the  number  of  transactions  constant.  The  number  of 
transactions  ranged  from  200,000  for  the  database  with  an  average  transaction  size  of  5  to 


57 


O      200     400     600     800     1000    1200 

Number  ol  Transactions  (in  thousands) 


-T10.I4  •■■«•-  T5.I2 


0       200      400      600      800     1000     1200 

Number  of  Transactions  (in  thousands) 

Figure  3.18.  Number  of  transactions  scale-up 

20,000  for  the  database  with  an  average  transaction  size  of  50.  We  fixed  the  minimum  sup- 
port level  in  terms  of  the  number  of  transactions,  since  fixing  it  as  a  percentage  would  have 
led  to  large  increases  in  the  number  of  frequent  itemsets  as  the  transaction  size  increased. 
The  numbers  in  the  legend  (for  example,  1000)  refer  to  this  minimum  support.  The  execu- 
tion times  increase  with  the  transaction  size,  but  only  gradually.  The  main  reason  for  this 
increase  was  that  the  number  of  item  combinations  present  in  a  transaction  increases  with 
the  transaction  size. 
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Figure  3.19.  Transaction  size  scale-up 


There  has  been  increasing  interest  in  developing  scalable  data  mining  techniques  [23, 
33,  117].  For  the  scalability  of  the  SQL  approaches,  we  leverage  the  development  effort 
spent  in  making  the  relational  query  processors  scalable. 

3.10     Summary 

In  this  chapter,  we  presented  four  SQL-92  formulations  for  association  rule  mining.  We 
analyzed  the  best  approach  and  developed  cost  formulae  for  the  different  execution  plans 
of  the  corresponding  SQL  queries  based  on  the  input  data  parameters  and  the  relational 
operator  costs.  This  cost  analysis  provides  a  basis  for  incorporating  the  semantics  of  mining 
algorithms  into  future  query  optimizers. 

While  doing  the  experiments,  it  was  difficult  to  force  the  optimizer  to  choose  certain 
execution  plans  and  join  methods  since  commercial  DBMSs  do  not  provide  that  level  of 
control.  Postgres  was  relatively  better  in  that  respect,  since  we  could  control  the  choice 
of  join  methods.  However,  the  Postgres  optimizer  did  not  optimize  long  queries,  especially 
the  ones  involving  nested  subqueries,  well. 
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Note  A  part  of  the  work  described  in  this  chapter  was  primarily  done  by  researchers 
from  IBM  Almaden  Research  Center.  Specifically,  the  SQL-based  candidate  generation  in 
Section  3.4.1  and  the  support  counting  approaches  in  Section  3.5  were  developed  by  them. 
They  are  included  in  this  dissertation  for  completeness. 


CHAPTER  4 
SUPPORT  COUNTING  USING  SQL  WITH  OBJECT-RELATIONAL  EXTENSIONS 


In  this  chapter,  we  study  alternative  approaches  that  make  use  of  additional  object- 
relational  features  in  SQL.  For  each  approach,  we  also  outline  a  cost-based  analysis  of  the 
execution  time  to  enable  one  to  choose  between  these  different  approaches.  We  present  six 
different  approaches,  optimizations  and  their  cost  estimates  in  Sections  4.1,  4.2  and  4.3. 
Experimental  results  comparing  the  performance  of  these  approaches  are  presented  in  Sec- 
tion 4.4.  In  Section  4.5,  we  propose  a  hybrid  approach  which  combines  the  best  of  all 
approaches.  The  performance  of  different  architectural  alternatives  described  in  Chapter  2 
is  compared  in  Section  4.6.  In  Section  4.7,  we  summarize  qualitative  comparisons  of  these 
architectures.  The  applicability  of  the  SQL-based  approach  to  other  association  rule  mining 
algorithms  are  briefly  discussed  in  Section  4.8. 

4.1     Gather  Join 

This  approach  (see  Figure  4.1)  is  based  on  the  use  of  table  functions  described  in  sec- 
tion 3.4.2.  It  generates  all  possible  &-item  combinations  of  items  contained  in  a  transaction, 
joins  them  with  the  candidate  table  Ck  and  counts  the  support  of  the  itemsets  by  grouping 
the  join  result.  It  uses  two  table  functions  Gather  and  Comb-K.  The  data  table  T  is  scanned 
in  the  (tid,  item)  order  and  passed  to  the  table  function  Gather.  This  table  function  col- 
lects all  the  items  of  a  transaction  (in  other  words,  items  of  all  tuples  of  T  with  the  same 
tid)  in  memory  and  outputs  a  record  for  each  transaction.  Each  such  record  consists  of  two 
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attributes,  the  tid  and  item-list  which  is  a  collection  of  all  its  items  in  a  VARCHAR  or  a 
BLOB.  The  output  of  Gather  is  passed  to  another  table  function  Comb-K  which  returns  all 
/c-item  combinations  formed  out  of  the  items  of  a  transaction.  A  record  output  by  Comb-K 
has  k  attributes  TJ,tm\, . . .  ,TJtmk,  which  can  be  directly  used  to  probe  into  the  CV-  table. 
An  index  is  constructed  on  all  the  items  of  Ck  to  make  the  probe  efficient.  Figure  4.1 
presents  SQL  queries  for  this  approach. 

This  approach  is  analogous  to  the  KwayJoin  approach  where  we  have  replaced  the 
A;-way  self  join  of  T  with  table  functions  Gather  and  Comb-K.  These  table  functions  are 
easy  to  code  and  do  not  require  a  large  amount  of  memory.  Also,  it  is  possible  to  merge 
them  together  as  a  single  table  function  GatherComb-K,  which  is  what  we  did  in  our  imple- 
mentation. The  Gather  function  is  not  required  when  the  data  is  already  in  a  horizontal 
format  where  each  tid  is  followed  by  a  collection  of  all  its  items. 

4.1.1  Special  Pass  2  Optimization 

Note  that  for  k  =  2,  the  2-candidate  set  Ci  is  simply  a  join  of  F\  with  itself.  Therefore, 
we  can  specially  optimize  the  pass  2  by  replacing  the  join  with  C%  by  a  join  with  F\  before 
the  table  function  (see  Figure  4.2).  That  way,  the  table  function  gets  only  frequent  items 
and  generates  significantly  fewer  2-item  combinations.  This  optimization  can  be  useful  for 
other  passes  too  but  unlike  for  pass  2  we  still  have  to  do  the  join  with  Cfc. 

4.1.2  Variations  of  GatherJoin  Approach 

GatherCount 

One  variation  of  the  GatherJoin  approach  for  pass  two  is  the  GatherCount  approach 
where  we  push  the  group-by  inside  the  table  function  instead  of  doing  it  outside.  The 
candidate  2-itemsets  (C2)  are  represented  as  a  two  dimensional  array  inside  the  modified 
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insert  into  Fk  select  itemi, . . .  ,itemk,  count(*) 

from  Cfc,  (select  t2.TJtmi,. . . ,  t2.T  Atmk  from  T, 

table  (Gather(T.tid,  T.item))  as  tu 

table  (Comb-K(/i.tid,  ^.item-list))  as  t2) 
where         t2-T  Atrri\  =  Ck-item\  and 

t2-T  dtmk  =  Ck.itemk 
group  by    Cjt.tiemi, . . . ,  Ck.itemk 
having        count(*)  >  rminsup 

having 
count(*)  >  iminsup 

t 
Group  by 

iteml,....,itemk 

t 
t2.T_itml  =  Ck.iteml   r-\V^- -1 

t2.T_itmk  =i  Ck.iteiTikJs^^J 

^^"t2  ^"s-->. 

Table  function  Ck 

Comb-K 

I 
Table  function 
Gather 

Order  by 
tid,  item 

T 

Figure  4.1.  Support  counting  by  Gather  Join 

table  function  Gather-Cnt  for  doing  the  support  counting.  Instead  of  outputting  the  2-item 
combinations  of  a  transaction,  it  directly  uses  it  to  update  support  counts  in  the  memory 
and  output  only  the  frequent  2-itemsets,  F2  and  their  support  after  the  last  transaction. 
Thus,  the  table  function  Gather-Cnt  is  an  extension  of  the  GatherComb-2  table  function 
used  in  Gather Join. 

The  absence  of  the  outer  grouping  makes  this  option  rather  attractive.  The  UDF  code 
is  also  small  since  it  only  needs  to  maintain  a  2D  array.  We  could  apply  the  same  trick  for 
subsequent  passes  but  the  coding  becomes  considerably  more  complicated  because  of  the 
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insert  into  F2  select  tti-T-itmi,  U2.TJ.tm2,  count(*) 

from  (select  *  from  T,  F\  where  T.item  =  F\.item{)  as  tt\, 

table  (GatherComb-2(tid,item))  as  tt2) 
group  by    tt2.TJtm\,  U2.TJ.tm2 
having        count(*)  >  :minsup 


having 
count(*)  >•  :minsup 

t 
Group  by 

tt2.X_itml,  tt2.T_itm2 

f     tl2 

Table  function 

GatherComb-K 

t 
.item  =  Fl.ileml  (j^XCJ 


T  Fl 

Figure  4.2.  Support  Counting  by  Gather  Join  in  the  second  pass 

need  to  maintain  hash-tables  to  index  the  C^'s.  The  disadvantage  of  this  approach  is  that 
it  can  require  a  large  amount  of  memory  to  store  C2.  If  enough  memory  is  not  available, 
C2  needs  to  be  partitioned  and  the  process  has  to  be  repeated  for  each  partition.  Another 
serious  problem  with  this  approach  is  that  it  cannot  be  automatically  parallelized  unlike 
the  other  approaches. 

GatherPrune 

A  potential  problem  with  the  Gather  Join  approach  is  the  high  cost  of  joining  the 
large  number  of  item  combinations  with  C*.  We  can  push  the  join  with  Ck  inside  the  table 
function  and  thus  reduce  the  number  of  such  combinations.  C*  is  converted  to  a  BLOB 
and  passed  as  an  argument  to  the  table  function. 

The  cost  of  passing  the  BLOB  for  every  tuple  of  T  could  be  high.  In  general,  we  can 
reduce  the  parameter  passing  cost  by  using  a  smaller  Blob  that  only  approximates  the  real 
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Cfc.  The  trade-off  is  increased  cost  for  other  parts  notably  grouping  because  not  as  many 
combinations  are  filtered. 

Horizontal 

Another  variation  of  GatherJoin  is  the  Horizontal  approach  that  first  uses  the 
Gather  function  to  transform  the  data  to  the  horizontal  format  but  is  otherwise  similar 
to  the  GatherJoin  approach. 

Rajamani  et  al.  [105]  propose  finding  associations  using  an  approach  quite  similar  to 
this  Horizontal  approach.  They  assume  (rather  unrealistically)  that  the  data  is  already  in 
a  horizontal  format.  However,  they  do  not  use  the  frequent  item-set  filtering  optimization 
we  outlined  for  pass  2.  Without  this  optimization,  the  time  for  pass  2  for  most  real-life 
datasets  blows  up  even  for  relatively  high  support  values.  Also,  at  the  time  of  candidate 
generation,  rather  than  doing  self-join  on  F*_i,  they  join  Fk-\  with  F\,  thereby  generating 
considerably  more  combinations  than  needed.  Thus,  the  approach  in  Rajamani  et  al.  [105] 
is  likely  to  perform  worse  than  Horizontal. 

4.1.3     Cost  Analysis  of  GatherJoin  and  its  Variants 

The  choice  of  the  best  approach  depends  on  a  number  of  data  characteristics  like  the 
number  of  items,  total  number  of  transactions,  average  length  of  a  transaction  and  so  on. 
We  express  the  costs  of  different  approaches  in  each  pass  in  terms  of  parameters  that  are 
known  or  can  be  estimated  after  the  candidate  generation  step  of  each  pass.  We  include 
only  the  terms  that  were  found  to  be  the  dominant  part  of  the  total  cost  in  practice.  We 
use  the  notations  of  Table  3.2  in  the  cost  analysis. 

The  cost  of  GatherJoin  includes  the  cost  of  generating  A;-item  combinations,  joining 
with  Cfc  and  grouping  to  count  the  support.  The  number  of  &-item  combinations  generated, 
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Tk  is  C(N,  k)  *  T.  Join  with  Ck  filters  out  the  non-candidate  item  combinations.  The  size 
of  the  join  result  is  the  sum  of  the  support  of  all  the  candidates  denoted  by  S(Ck).  The 
actual  value  of  the  support  of  a  candidate  itemset  will  be  known  only  after  the  support 
counting  phase.  However,  we  get  a  good  estimate  by  approximating  it  to  the  minimum  of 
the  support  of  all  its  (k  —  l)-subsets  in  Fk-i-  The  total  cost  of  the  Gather  Join  approach  is 

Tk  *  tk  +  join(Tfc,  Ck,S{Ck))  +  gvonp(S(Ck),Ck),  where  Tk  =  C(N,k)  *  T 

The  above  cost  formula  needs  to  be  modified  to  reflect  the  special  optimization  of 
joining  with  F\  to  consider  only  frequent  items.  We  need  a  new  term  join(i?, F\,Rf)  and 
need  to  change  the  formula  for  Tk  to  include  only  frequent  items  Nj  instead  of  N. 

For  the  second  pass,  we  do  not  need  the  outer  join  with  Ck.  The  total  cost  of 
GatherJoin  in  the  second  pass  is 


N2f*T 
join(J2,  FuRf)  +  T2*t2  +  group(T2,  C2),  where  T2  =  C(Nf,  2)*T^  -J—- 


Cost  of  GatherCount  in  the  second  pass  is  similar  to  that  for  basic  GatherJoin  except 
for  the  final  grouping  cost.  In  this  formula,  "group_int"  denotes  the  cost  of  doing  the 
support  counting  inside  the  table  function. 

join(fl,  Fi,Rf)  +  group  _int(T2,  C2)  +F2*t2 

For  GatherPrune  the  cost  equation  is 

R  *  blob(A:  *  Ck)  +  S(Ck)  *  tk  +  group(S(Ck),  Ck). 


(.() 


We  use  blob(A:  *  Ck)  for  the  BLOB  passing  cost  since  each  itemset  in  Ck  contains  k  items. 

The  cost  estimate  of  Horizontal  is  similar  to  that  of  GatherJoin  except  that  here 

the  data  is  materialized  in  the  horizontal  format  before  generating  the  item  combinations. 

4.2     Vertical 

In  this  approach,  we  first  transform  the  data  table  into  a  vertical  format  by  creating  for 
each  item  a  BLOB  containing  all  tids  that  contain  that  item  (Tid-list  creation  phase)  and 
then  count  the  support  of  itemsets  by  merging  together  these  tid-lists  (support  counting 
phase).  This  approach  is  related  to  the  approaches  discussed  in  Savasere  et  al.  [Ill]  and 
Zaki  et  al.  [131]. 

For  creating  the  Tid-lists  we  use  a  table  function  Gather.  This  is  the  same  as  the 
Gather  function  in  GatherJoin  except  that  here  we  create  the  tid-list  for  each  frequent 
item.  The  data  table  T  is  scanned  in  the  (item,  tid)  order  and  passed  to  the  function 
Gather.  The  function  collects  the  tids  of  all  tuples  of  T  with  the  same  item  in  memory  and 
outputs  a  (item,  tid-list)  tuple  for  items  that  meet  the  minimum  support  criterion.  The  tid- 
lists  are  represented  as  BLOBs  and  stored  in  a  new  TidTable  with  attributes  (item,  tid-list). 
The  SQL  query  which  does  the  transformation  to  vertical  format  is  given  in  Figure  4.3. 

insert  into  TidTable  select  item,  tid-list  from  em'  !' 

\   tt2 

Table  function 

(select  *  from  T  order  by  item,  tid)  as  tt\,  Gather 

t 
table(Gather(item,tid,:minsup))  as  ii2  itom.tW 

f 
T 

Figure  4.3.  Tid-list  creation 

In  the  support  counting  phase,  conceptually  for  each  itemset  in  Ck  we  want  to  collect 
the  tid-lists  of  all  k  items  and  use  a  UDF  to  count  the  number  of  tids  in  the  intersection 
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of  these  k  lists.  The  tids  are  in  the  same  sorted  order  in  all  the  tid-lists  and  therefore  the 
intersection  can  be  done  easily  and  efficiently  by  a  single  pass  of  the  k  lists.  This  conceptual 
step  can  be  improved  further  by  decomposing  the  intersect  operation  so  that  we  can  share 
these  operations  across  itemsets  having  common  prefixes  as  follows: 

We  first  select  distinct  (itemi, item.2)  pairs  from  Cfc.  For  each  distinct  pair  we  first  per- 
form the  intersect  operation  to  get  a  new  result-tidlist,  then  find  distinct  triples  {item\ ,  item.2, 
itemz)  from  Ck  with  the  same  first  two  items,  intersect  result-tidlist  with  tid-list  for  item?, 
for  each  triple  and  continue  with  item\  and  so  on  until  all  k  tid-lists  per  itemset  are  inter- 
sected. 

The  above  sequence  of  operations  can  be  written  as  a  single  SQL  query  for  any  k  as 
shown  in  Figure  4.4.  The  final  intersect  operation  can  be  merged  with  the  count  operation 
to  return  a  count  instead  of  the  tid-list.  We  do  not  include  this  optimization  in  the  query 
of  Figure  4.4  for  simplicity. 

4.2.1     Special  Pass  2  Optimization 

For  pass  2  we  need  not  generate  Ci  and  join  the  TidTables  with  C2.  Instead,  we 
perform  a  self-join  on  the  TidTable  using  predicate  t\.item  <  t2.it.em. 

insert  into  Fj  select  t\.item,t2.item,  cnt 

from  (select  iterri\, item.2,  CountIntersect(<i. tid-list,  ^.tid-list)  as  cnt 

from     TidTable  tu  TidTable  t2 

where  t\.item  <  t2.item)  as  t 
where         cnt  >  :minsup 


(iS 


insert  into  F*.  select  itemi, . . .  ,itemk,  count(tid-list)  as  cnt 
from  (Subquery  Qk)  t   where  cnt  >  :minsup 

Subquery  Qi  (for  any  I  between  2  and  k) 

select  itemi,. .  .itemi,  Intersect(r/_1.tid-list,//.tid-list)  as  tid-list 
from    TidTable  ti,  (Subquery  Qi-\)  as  rj_!, 

(select  distinct  item\  . . .  itemi  from  Ck)  as  d/ 
where  ri_\.item\  —  di.item\  and  . . .  and 

ri-\.itemi-\  =  d/.z£emj_iand 

ti.item  =  di.itemi 

Subquery  Q\ :   (select  *  from  TidTable) 

itemi,. ...itemi,  tid-list 

|  tid-list 
Intersect 
(UDF) 
r_l- 1. tid-list,  tl.tid-list     t 

ti.item  =  di.itemi    P^XCj 
r_l-l.  itemi        =. di.itemi 

r_l-l.item_l-l  =dl.item_l-l      P><CJ  TidTable  tl 

r  I-l^-""""         ^\dl 


Subquery  Q_l-1 


select  distinct 
iteml,...,iteml 

t 


Ck 


Tree  diagram  for  Subquery  Qi 


Figure  4.4.  Support  counting  using  UDF 
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4.2.2     Cost  Analysis 

The  cost  of  the  Vertical  approach  during  support  counting  phase  is  dominated  by 
the  cost  of  invoking  the  UDFs  and  intersecting  the  tid-lists.  The  UDF  is  first  called  for 
each  distinct  item  pair  in  C*,  then  for  each  distinct  item  triple  with  the  same  first  two 
items  and  so  on.  Let  d1-  be  the  number  of  distinct  j  item  tuples  in  Ck.  Then  the  total 
number  of  times  the  UDF  is  invoked  is  Y!j=2  d*.  In  each  invocation  two  BLOBs  of  tid-list 
are  passed  as  arguments.  The  UDF  intersects  the  tid-lists  by  a  merge  pass  and  hence  the 
cost  is  proportional  to  2  *  average  length  of  a  tid-list.  The  average  length  of  a  tid-list  can  be 
approximated  to  j£.  Note  that  with  each  intersect  the  tid-list  keeps  shrinking.  However, 
we  ignore  such  effects  for  simplicity. 

In  addition  to  the  intersect  cost  it  includes  the  cost  of  joins  in  the  query  also.  The 
join  cost  of  subquery  Qi  can  be  recursively  defined  as 

C(Qi)  =  tiijom(F1,Qt-.1,Ck,4)  +  C(Ql-1) 

where  tri-join(p,  q,  r,  s)  denotes  the  cost  of  joining  three  relations  of  size  p,  q,  r  respectively 
producing  a  result  of  size  a.  The  exact  cost  of  the  3- way  join  will  depend  on  the  join  order. 
The  cost  of  subquery  Qx  is  the  cost  of  scanning  the  TidTable  which  has  Fi  tuples.  The 
result  size  of  the  subquery  Qt  is  dk  and  the  result  size  of  Qx  is  F\.  The  total  cost  of  the 
Vertical  approach  is 

k 

C(Qk)  +  (£  4)  *  I2  *  Blob  A  +  Intersect^)} 

;=2  *i  *i 
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In  the  formula  above  Intersect (n)  denotes  the  cost  of  intersecting  two  tid-lists  with 
a  combined  size  of  n.  The  total  cost  is  dominated  by  the  intersect  cost  and  join  costs 
account  for  only  a  small  fraction.  Therefore,  we  can  safely  ignore  the  join  costs  in  the 
above  formulae. 

The  total  cost  of  the  second  pass  is 


join(F1,F1,  ^-)  +  C2  *  {2  *  Blob  A  +  Intersect^-)} 


4.3     SQL-Bodied  Functions 

This  approach  is  based  on  SQL-bodied  functions  commonly  known  as  SQL/PSM  [87]. 
SQL/PSMs  extend  SQL  with  additional  control  structures.  We  make  use  of  one  such 
construct  for  do  ..  end. 

We  use  the  for  construct  to  scan  the  transaction  table  T  in  the  (tid,  item)  order.  Then, 
for  each  tuple  (tid,  item)  of  T,  we  update  those  tuples  of  Ck  that  contain  one  matching 
item.  Ck  is  extended  with  3  extra  attributes  (prevTid,  match,  supp).  The  prevTid  attribute 
keeps  the  tid  of  the  previous  tuple  of  T  that  matched  that  itemset.  The  match  attribute 
contains  the  number  of  items  of  prevTid  matched  so  far  and  supp  holds  the  current  support 
of  that  itemset.  On  each  column  of  Ck  an  index  is  built  to  do  a  searched  update. 

for  this  as  select  *  from  T  do 
update  Ck  set  prevTid  =  tid, 

match  =  case  when  tid  =  prevTid  then  match+1  else  1  end, 
supp  =  case  when  match  =  k-1  and  tid  =  prevTid  then  supp+1 
else  supp 
end 
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where  item  =  item\  or 

item  =  itemk 
end  for 

insert  into  Fk  select  itemi, . . .  ,itemk,  supp 
from  Ck  where  supp  >  :minsupp 

The  cost  of  this  approach  can  be  mainly  attributed  to  the  cost  of  updates  to  the 
candidate  table  Ck-  For  each  tuple  of  the  data  table  T,  for  all  the  candidate  itemsets  in  Ck 
which  contains  that  item,  three  updates  are  performed  (the  attributes  prevTid,  match,  supp 
of  the  itemset  are  updated).  If  Nk  is  the  average  number  of  A;-item  candidates  containing 
any  given  item,  the  total  number  of  updates  is  3  *  R*  JV*.  The  cost  due  to  updates  for  this 
approach  is  U(3  *  R  *  Nk)  where  U(n)  is  the  cost  of  n  updates.  If  the  updates  are  logged 
this  cost  includes  the  logging  cost  also. 

4.4     Performance  Comparison 

We  studied  the  performance  of  six  approaches  in  this  category:  Gather  Join  and  its 
variants  GatherPrune,  Horizontal  and  GatherCountVertical  and  SBF.  We  used  the  four 
datasets  summarized  in  Table  3.1.  In  Figure  4.5  we  show  the  performance  of  only  the  four 
approaches:  Gather  Join,  GatherCount,  GatherPrune  and  Vertical.  For  the  other 
two  approaches  the  running  times  were  comparatively  so  large  that  we  had  to  abort  the 
runs  in  many  cases.  The  main  reason  why  the  Horizontal  approach  was  significantly 
worse  than  the  Gather  Join  approach  was  the  time  to  transform  the  data  to  the  horizontal 
format.  For  instance,  for  Dataset-C  it  was  3.5  hours  which  is  almost  20  times  more  than 
the  time  taken  by  Vertical  for  2%  support.    For  Dataset-B  the  process  was  aborted 
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after  running  for  5  hours.  After  the  transformation,  compared  to  GatherJoin  the  time 
taken  by  Horizontal  was  also  significantly  worse  when  run  without  the  frequent  itemset 
filtering  optimization  but  with  the  optimization  the  performance  was  comparable.  The  SBF 
approach  had  significantly  worse  performance  because  of  the  expensive  indexing  ORing  of 
the  k  join  predicates.  Another  problem  with  this  approach  is  the  large  number  of  updates 
to  the  Ck  table.  In  DB2,  all  of  these  updates  are  logged  resulting  in  severe  performance 
degradation. 


Data  Mt-  A 

1  Prep  ■  Pass  1  o  Pass  2  □  Pass  3, 


Data  set-  C 


□  Prep  ■  Pass  1  □  Pass  2  m  Pass  3  ■  Pass  4  | 
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Data  set-   D 
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Figure  4.5.  Comparison  of  four  SQL-OR  approaches:  Vertical,  GatherPrune.  GatherJoin 
and  GatherCount 
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Figure  4.5  shows  the  total  running  time  of  the  different  approaches.  The  time  taken 
is  broken  down  by  each  pass  and  an  initial  "prep"  stage  where  any  one-time  data  transfor- 
mation cost  is  included.  We  can  make  several  observations  from  the  experimental  results. 
First,  let  us  concentrate  on  the  overall  comparison  between  the  different  approaches.  Then 
we  will  compare  the  approaches  based  on  how  they  perform  in  each  pass  of  the  algorithm. 

Overall,  the  Vertical  approach  has  the  best  performance  and  is  sometimes  more  than 
an  order  of  magnitude  better  than  the  other  three  approaches. 

The  majority  of  the  time  of  the  Vertical  approach  is  spent  in  transforming  the 
data  to  the  Vertical  format  in  most  cases  (shown  as  "prep"  in  figure  4.5).  The  vertical 
representation  is  like  an  index  on  the  item  attribute.  If  we  think  of  this  time  as  a  one- 
time activity  like  index  building  then  performance  looks  even  better.  Note  that  the  time 
to  transform  the  data  to  the  Vertical  format  was  much  smaller  than  the  time  for  the 
horizontal  format  although  both  formats  write  almost  the  same  amount  of  data.  The  main 
reason  was  the  difference  in  the  number  of  records  written.  The  number  of  frequent  items 
is  often  two  to  three  orders  of  magnitude  smaller  than  the  number  of  transactions. 

Between  Gather  Join  and  GatherPrune,  neither  strictly  dominates  the  other.  The 
special  optimization  in  Gather  Join  of  pruning  based  on  F\  had  a  big  impact  on  perfor- 
mance. With  this  optimization,  for  Dataset-B  with  support  0.1%,  the  running  time  for 
pass  2  alone  was  reduced  from  5.2  hours  to  10  minutes. 

When  we  compare  these  different  approaches  based  on  time  spent  in  each  pass  we 
observe  that  no  single  approach  is  "the  best"  for  all  different  passes  of  the  different  datasets 
especially  for  the  second  pass. 

For  pass  three  onwards,  Vertical  is  often  two  or  more  orders  of  magnitude  better  than 
the  other  approaches.  Even  in  cases  like  Dataset-B,  support  0.01%  where  it  spends  three 
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hours  in  the  second  pass,  the  total  time  for  next  two  passes  is  only  14  seconds  whereas  it  is 
more  than  an  hour  for  the  other  two  approaches.  For  subsequent  passes  the  performance 
degrades  dramatically  for  Gather  Join,  because  the  table  function  Gather-Comb-K  generates 
a  large  number  of  combinations.  For  instance,  for  pass  3  of  Dataset-C  even  for  support 
value  of  2%  pass  3  did  not  complete  after  5.2  hours  whereas  for  Vertical  pass  3  finished 
in  0.2  seconds.  GatherPrune  is  better  than  Gather  Join  for  the  third  and  later  passes.  For 
pass  2  GatherPrune  is  worse  because  the  overhead  of  passing  a  large  object  as  an  argument 
dominates  cost. 

The  Vertical  approach  sometimes  ended  up  spending  too  much  time  in  the  second 
pass.  In  some  of  these  cases  the  GatherJoin  approach  was  better  in  the  second  pass 
(for  instance  for  low  support  values  of  Dataset-B)  whereas  in  other  cases  (for  instance, 
Dataset-C  support  0.25%)  GatherCount  was  the  only  good  option.  For  this  case  both  the 
GatherPrune  and  GatherJoin  did  not  complete  after  more  than  six  hours  even  for  pass  2. 
Further,  they  caused  a  storage  overflow  error  because  of  the  large  size  of  the  intermediate 
results  to  be  sorted.  We  had  to  divide  the  dataset  into  four  equal  parts  and  run  the  second 
pass  independently  on  each  partition  to  avoid  this  problem. 

-    Vertical  Gioln 


^=      300 
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Averaya  transaction   length 


Figure  4.6.  Effect  of  increasing  transaction  length  (average  number  of  items  per  transaction) 
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Two  factors  that  affect  the  choice  amongst  the  Vertical,  Gather  Join  and  GatherCount 
approaches  in  different  passes  and  pass  2  in  particular  are:  number  of  frequent  items  (Fi) 
and  the  average  number  of  frequent  items  per  transaction  (Nf).  From  the  graphs  in  Fig- 
ure 4.5  we  notice  that  as  the  value  of  the  support  is  decreased  for  each  dataset  causing 
the  size  of  F\  to  increase,  the  performance  of  pass  2  of  the  Vertical  approach  degrades 
rapidly.  This  trend  is  also  clear  from  our  cost  formulae.  The  cost  of  the  Vertical  approach 
increases  quadratically  with  F\.  Gather  Join  depends  more  critically  on  the  number  of  fre- 
quent items  per  transaction.  For  Dataset-B  even  when  the  size  of  F\  increases  by  a  factor 
of  10,  the  value  of  Nf  remains  close  to  2,  therefore  the  time  taken  by  Gather  Join  does  not 
increase  as  much.  However,  for  Dataset-C  the  size  of  Nf  increases  from  3.2  to  10  as  the 
support  is  decreased  from  2.0%  to  0.25%  causing  Gather  Join  to  deteriorate  rapidly.  From 
the  cost  formula  for  Gather  Join  we  notice  that  the  total  time  for  pass  2  increases  almost 
quadratically  with  Nf. 

We  validate  this  observation  further  by  running  experiments  on  synthetic  datasets 
for  varying  values  of  the  number  of  frequent  items  per  transaction.  We  used  the  synthetic 
dataset  generator  described  in  Agrawal  et  al.  [9]  for  this  purpose.  We  varied  the  transaction 
length,  the  number  of  transactions  and  the  support  values  while  keeping  the  total  number 
of  records  and  the  number  of  frequent  items  fixed.  In  Figure  4.6  we  show  the  total  time 
spent  in  pass  2  of  the  Vertical  and  Gather  Join  approaches.  As  the  number  of  items  per 
transaction  (transaction  length)  increases,  the  cost  of  Vertical  remains  almost  unchanged 
whereas  the  cost  of  Gather  Join  increases. 
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4.5     Final  Hybrid  Approach 

The  previous  performance  section  helps  us  draw  the  following  conclusion:  Overall,  the 
Vertical  approach  is  the  best  option  especially  for  higher  passes.  When  the  size  of  the 
candidate  itemsets  is  too  large,  the  performance  of  the  Vertical  approach  could  suffer. 
In  such  cases,  Gather  Join  is  a  good  option  as  long  as  the  number  of  frequent  items  per 
transaction  (Nj)  is  not  too  large.  When  that  happens  GatherCount  may  be  the  only 
good  option  even  though  it  may  not  easily  parallelizable.  These  qualitative  differences  are 
captured  by  the  cost  formulae  we  presented  earlier  and  are  used  by  our  final  hybrid  scheme. 

The  hybrid  scheme  chooses  the  best  of  the  three  approaches  Gather  Join,  GatherCount 
and  Vertical  for  each  pass  based  on  the  cost  estimation  outlined  in  the  previous  sections. 
The  parameter  values  used  for  the  estimation  are  all  available  at  the  end  of  the  previous 
pass.  In  Section  4.6  we  plot  the  final  running  time  for  the  different  datasets  based  on  this 
hybrid  approach. 

4.6     Architecture  Comparisons 

In  this  section  our  goal  is  to  compare  five  architectural  alternatives:  Loose-coupling, 
Stored-procedure,  Cache-Mine,  UDF,  and  the  best  SQL  implementation. 

For  Loose-coupling,  we  use  the  implementation  of  the  Apriori  algorithm  [9]  for  find- 
ing association  rules  provided  with  the  IBM  data  mining  product,  Intelligent  Miner  [71]. 
For  Stored-procedure,  we  extracted  the  Apriori  implementation  in  Intelligent  Miner  and 
created  a  stored  procedure  out  of  it.  The  stored  procedure  is  run  in  the  unfenced  mode 
in  the  database  address  space.  For  Cache-Mine,  we  used  an  option  provided  in  Intelligent 
Miner  that  causes  the  input  data  to  be  cached  as  a  binary  file  after  the  first  scan  of  the  data 
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from  the  DBMS.  The  data  is  copied  in  the  horizontal  format  where  each  tid  is  followed  by 
an  encoding  of  all  its  frequent  items. 

For  the  UDF-architecture,  we  use  the  UDF  implementation  of  the  Apriori  algorithm 
described  in  Agrawal  and  Shim  [12].  In  this  implementation,  first  a  UDF  is  used  to  initialize 
state  and  allocate  memory  for  candidate  itemsets.  Next,  for  each  pass  a  collection  of  UDFs 
are  used  for  generating  candidates,  counting  support,  and  checking  for  termination.  These 
UDFs  access  the  initially  allocated  memory,  address  of  which  is  passed  around  in  BLOBs. 
Candidate  generation  creates  the  in-memory  hash-trees  of  candidates.  This  happens  en- 
tirely in  the  UDF  without  any  involvement  of  the  DBMS.  During  support  counting,  the 
data  table  is  scanned  sequentially  and  for  each  tuple  a  UDF  is  used  for  updating  the  counts 
on  the  memory  resident  hash-tree. 

4.6.1     Timing  Comparison 

In  Figure  4.7,  we  show  the  performance  of  the  four  architectural  alternatives:  Cache-Mine, 
Stored-procedure,  UDF  and  our  best  SQL  implementation  for  the  datasets  in  Table  3.1. 
We  do  not  show  the  times  for  the  Loose-coupling  option  because  its  performance  was  very 
close  to  the  Stored-procedure  option.  For  each  dataset  three  different  support  values  are 
used.  The  total  time  is  broken  down  by  the  time  spent  in  each  pass. 

We  can  make  the  following  observations. 

•  Cache-Mine  has  the  best  or  close  to  the  best  performance  in  all  cases.  80-90%  of  its 
total  time  is  spent  in  the  first  pass  where  data  is  accessed  from  the  DBMS  and  cached 
in  the  file  system.  Compared  to  the  SQL  approach  this  approach  is  a  factor  of  0.8  to 
2  times  faster. 
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Figure  4.7.  Comparison  of  four  architectures 

The  Stored-procedure  approach  is  the  worst.  The  difference  between  Cache-Mine 
and  Stored-procedure  is  directly  related  to  the  number  of  passes.  For  instance, 
for  Dataset-A  the  number  of  passes  increases  from  two  to  three  when  decreasing 
support  from  0.5%  to  0.35%  causing  the  time  taken  to  increase  from  two  to  three 
times.  The  time  spent  in  each  pass  for  Stored-procedure  is  the  same  except  when 
the  algorithm  makes  multiple  passes  over  the  data  since  all  candidates  could  not 
fit  in  memory  together.   This  happens  for  the  lowest  support  values  of  Dataset-B, 
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Dataset-C  and  Dataset-D.  Time  taken  by  Stored-procedure  can  be  expressed 
approximately  as  number  of  passes  times  time  taken  by  Cache-Mine. 

•  UDF  is  similar  to  Stored-procedure.  The  only  difference  is  that  the  time  per  pass 
decreases  by  30-50%  for  UDF  because  of  closer  coupling  with  the  database. 

•  The  SQL  approach  comes  second  in  performance  after  the  Cache-Mine  approach  for 
low  support  values  and  is  even  somewhat  better  for  high  support  values.  The  cost 
of  converting  the  data  to  the  vertical  format  for  SQL  is  typically  lower  than  the  cost 
of  transforming  data  to  binary  format  outside  the  DBMS  for  Cache-Mine.  However, 
after  the  initial  transformation  subsequent  passes  take  negligible  time  for  Cache-Mine. 
For  the  second  pass  SQL  takes  significantly  more  time  than  Cache-Mine  particularly 
when  we  decrease  support.  For  subsequent  passes  even  the  SQL  approach  does  not 
spend  too  much  time.  Therefore,  the  difference  between  Cache-Mine  and  SQL  is  not 
very  sensitive  to  the  number  of  passes  because  both  approaches  spend  negligible  time 
in  higher  passes. 

The  SQL  approach  is  1.8  to  3  times  better  than  Stored-procedure  or  Loose-coupling 
approach.  As  we  decreased  the  support  value  so  that  the  number  of  passes  over  the 
dataset  increases,  the  gap  widens.  Note  that  we  could  have  implemented  a  Stored- 
procedure  using  the  same  hybrid  algorithm  that  we  used  for  SQL  instead  of  using  the 
IM  algorithm.  Then,  we  expect  the  performance  of  Stored-procedure  to  improve 
because  the  number  of  passes  to  the  data  will  decrease.  However,  we  will  pay  the 
storage  penalty  of  making  additional  copy  of  the  data  as  we  did  in  the  Cache-Mine 
approach.  The  performance  of  Stored-procedure  cannot  be  better  than  Cache-Mine 
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because  as  we  have  observed  the  most  of  the  time  of  Cache-Mine  is  spent  in  the  first 
pass  which  cannot  be  changed  for  Stored-procedure. 

4,6.2     Scale-up  experiment 

Our  experiments  with  the  four  real-life  datasets  above  has  shown  the  scaling  property 
of  the  different  approaches  with  decreasing  support  value  and  increasing  number  of  frequent 
itemsets.  We  experiment  with  synthetic  datasets  to  study  other  forms  of  scaling:  increasing 
number  of  transactions  and  increasing  average  length  of  transactions.  In  Figure  4.8  we  show 
how  Stored-procedure,  Cache-Mine  and  SQL  scale  with  increasing  number  of  transactions. 
UDF  and  Loose-coupling  have  similar  scale-up  behavior  as  Stored-procedure,  therefore 
we  do  not  show  these  approaches  in  the  figure.  We  used  a  dataset  with  10  average  number  of 
items  per  transaction,  100  thousand  total  items  and  a  default  pattern  length  (defined  in  [9]) 
of  4.  Thus,  the  size  of  the  dataset  is  10  times  the  number  of  transactions.  As  the  number  of 
transactions  is  increased  from  10K  to  3000K  the  time  taken  increases  proportionately.  The 
largest  frequent  itemset  was  5  long.  This  explains  the  five  fold  difference  in  performance 
between  the  Stored-procedure  and  the  Cache-Mine  approach.  Figure  4.9  shows  the  scaling 
when  the  transaction  length  changes  from  3  to  50  while  keeping  the  number  of  transactions 
fixed  at  100K.  All  three  approaches  scale  linearly  with  increasing  transaction  length. 
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Figure  4.8.  Scale-up  with  increasing  number  of  transactions 
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Figure  4.9.  Scale-up  with  increasing  transaction  length 

4.6.3     Impact  of  longer  names 

In  these  experiments  we  assumed  that  the  tids  and  item-ids  are  all  integers.  Often  in 
practice  these  are  character  strings  longer  than  four  characters.  Longer  strings  need  more 
storage  and  cost  more  during  comparisons.  This  could  hurt  all  four  of  the  alternatives. 
For  the  Stored-procedure,  UDF  and  Cache-Mine  approach  the  time  taken  to  transfer  data 
will  increase.  The  Intelligent  Miner  code  [71]  maps  all  character  fields  to  integers  using  an 
in-memory  hash-table.  Therefore,  beyond  the  increase  in  the  data  transfer  and  mapping 
costs  (which  accounts  for  the  bulk  of  the  time),  we  do  not  expect  the  processing  time  of 
these  three  alternatives  to  increase.  For  the  SQL  approach  we  cannot  assume  an  in-memory 
hash-table  for  doing  the  mapping  therefore  we  use  an  alternative  approach  based  on  table 
functions. 

For  SQL  approach  we  discuss  the  hybrid  approach.  The  two  (already  expensive)  steps 
that  could  suffer  because  of  longer  names  are  (1)  final  group- bys  during  pass  2  or  higher 
when  the  Gather  Join  approach  is  chosen  and  (2)  tid-list  operations  when  the  Vertical 
approach  is  chosen.  For  efficient  performance,  the  first  step  requires  a  mapping  of  item-ids 
and  the  second  one  requires  us  to  map  tids.   We  use  a  table  function  to  map  the  tids  to 
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unique  integers  efficiently  in  one  pass  and  without  making  extra  copies.  The  input  to  the 
table  function  is  the  data  table  in  the  tid  order.  The  table  function  remembers  the  previous 
tid  and  the  maintains  a  counter.  Every  time  the  tid  changes,  the  counter  is  incremented. 
This  counter  value  is  the  mapping  assigned  to  each  tid.  We  need  to  do  the  tid  mapping  only 
once  before  creating  the  TidTable  in  the  Vertical  approach  and  therefore  we  can  pipeline 
these  two  steps.  The  item  mapping  is  done  slightly  differently.  After  the  first  pass,  we  add 
a  column  to  F\  containing  a  unique  integer  for  each  item.  We  do  the  same  for  the  TidTable. 
The  Gather  Join  approach  already  joins  the  data  table  T  with  Fi  before  passing  to  table 
function  Gather.  Therefore,  we  can  pass  to  Gather  the  integer  mappings  of  each  item  from 
Fi  instead  of  its  original  character  representation.  After  these  two  transformations,  the  tid 
and  item  fields  are  integers  for  all  the  remaining  queries  including  candidate  generation  and 
rule  generation.  By  mapping  the  fields  this  way,  we  expect  longer  names  to  have  similar 
performance  impact  on  all  of  our  architectural  options. 

4.6.4     Space  Overhead  of  Different  Approaches 

In  Figure  4.10  we  summarize  the  space  required  for  different  datasets  for  three  options: 
Stored-procedure,  Cache-Mine  and  SQL.  For  these  experiments  we  assume  that  the  tids 
and  items  are  integers.  The  first  part  refers  to  the  space  used  in  caching  data  and  the 
second  part  refers  to  any  temporary  space  used  by  the  DBMS  for  sorting  or  alternately 
for  constructing  indices  to  be  used  during  sorting.  The  size  of  the  data  is  the  same  as  the 
space  utilization  of  the  Stored-procedure  approach.  The  space  requirements  for  UDF  is 
the  same  as  that  for  Stored-procedurewhich  requires  less  space  than  the  Cache-Mine  and 
SQL  approaches.  The  Cache-Mine  and  SQL  approaches  have  comparable  storage  overheads. 
For  Stored-procedure  and  UDF  we  do  not  need  any  extra  storage  for  caching.   However, 
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all  three  options  Cache-Mine,  Stored-procedure  and  UDF  require  data  in  each  pass  to  be 
grouped  on  the  tid.  In  a  relational  DBMS  we  cannot  assume  any  order  on  the  physical 
layout  of  a  table,  unlike  in  a  file  system.  Therefore,  we  need  either  an  index  on  the  data 
table  or  need  to  sort  the  table  every  time  to  ensure  a  particular  order.  Let  R  denote  the 
total  number  of  (tid,item)  pairs  in  the  data  table.  Either  option  has  a  space  overhead  of 
2  x  R  integers.  The  Cache-Mine  approach  caches  the  data  in  an  alternative  binary  format 
where  each  tid  is  followed  by  all  the  items  it  contains.  Thus,  the  size  of  the  cached  data 
in  Cache-Mine  is  at  most:  R  +  T  integers  where  T  is  the  number  of  transactions.  For  SQL 
we  use  the  hybrid  Vertical  option.  This  requires  creation  of  an  initial  TidTable  of  size  at 
most  I  +  R  where  I  is  the  number  of  items.  Note  that  this  is  slightly  less  than  the  cache 
required  by  the  Cache-Mine  approach.  The  SQL  approach  needs  to  sort  data  in  pass  1  in 
all  cases  and  pass  2  in  some  cases  where  we  used  the  Gather  Join  approach  instead  of  the 
Vertical  approach.  This  explains  the  large  space  requirement  for  Dataset-B.  However,  in 
practice  when  the  item-ids  or  tids  are  character  strings  instead  of  integers,  the  extra  space 
needed  by  Cache -Mine  and  SQL  is  a  much  smaller  fraction  of  the  total  data  size  because 
before  caching  we  always  convert  item-ids  to  their  compact  integer  representation  and  store 
in  binary  format. 

4.7     Summary  of  Comparison  Between  Different  Architectures 

In  Table  4.1  we  present  a  summary  of  the  pros  and  cons  of  the  different  architectures 
by  ranking  them  on  a  scale  of  1  (good)  to  4  (bad)  on  each  of  the  following  yardsticks:  (a) 
performance  (execution  time);  (b)  storage  overhead;  (c)  scope  for  automatic  parallelization; 
(d)  development  and  maintenance  ease;  (e)  portability  (f)  inter-operability. 
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Figure  4.10.  Comparison  of  different  architectures  on  space  requirements. 

Table  4.1.  Pros  and  cons  of  different  architectural  options  ranked  on  a  scale  of  l(good)  to 
4(bad) 


Metric 

Stored-proc. 

UDF 

Cache-Mine 

SQL 

Performance 

4 

3 

1 

2 

Storage  overhead 

1 

1 

2 

2-3 

Automatic  Parallelism 

2 

2 

2 

1(?) 

Development  and  maintenance  ease 

2 

3 

2 

1-2 

Portability 

1 

3 

1 

2 

Inter-operability 

2 

2 

2 

1(?) 

In  terms  of  performance,  the  Cache-Mine  approach  is  the  best  option  followed  by  the 
SQL  approach.  The  SQL  approach  was  within  a  factor  of  0.8  to  2  of  the  Cache-Mine  ap- 
proach for  all  of  our  experiments.  The  UDF  approach  is  better  than  the  Stored-procedure 
approach  in  performance  by  30  to  50%  but  it  looses  on  the  metrics  of  development  and 
maintenance  costs  and  portability.  In  terms  of  space  requirements,  the  Cache-Mine  and 
the  SQL  approach  loose  to  the  UDF  or  the  Stored-procedure  approach.  Between  the 
Stored-procedure  and  the  Cache-Mine  implementation,  the  performance  difference  is  ex- 
actly a  function  of  the  number  of  passes  made  on  the  data;  that  is,  if  we  make  four  passes 
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of  the  data  the  Stored-procedure  approach  is  four  times  slower  than  Cache-Mine.  There- 
fore, if  one  is  not  willing  to  pay  the  penalty  of  extra  storage  the  best  strategy  for  improving 
performance  is  to  reduce  the  number  of  passes  to  the  data  even  if  it  comes  at  the  cost 
of  extra  processing.  Some  of  the  recent  proposals  [26,  127]  that  attempt  to  minimize  the 
number  of  data  passes  to  2  or  3  might  be  useful  in  that  regard. 

The  SQL  approach  is  not  the  winner  in  terms  of  performance  and  space  requirements 
but  it  is  competitive.  The  benefit  of  the  SQL  approach  could  arise  from  other  secondary 
advantages. 

The  SQL  implementation  has  the  potential  for  automatic  parallelization.  Paralleliza- 
tion  could  come  for  free  because  bulk  of  our  processing  is  expressed  in  terms  of  standard 
SQL  queries.  As  long  as  the  database  supports  efficient  parallelization  of  these  queries  the 
mining  code  can  be  easily  parallelized.  The  problem  case  is  where  the  UDFs  use  scratch 
pads.  The  only  such  function  in  our  queries  is  the  Gather  table  function.  This  function 
essentially  implements  a  user  defined  aggregate,  and  would  have  been  easy  to  parallelize  if 
the  DBMS  provided  support  for  user  defined  aggregates  or  allowed  explicit  control  from  the 
application  about  how  to  partition  the  data  amongst  different  parallel  instances  of  the  func- 
tion. For  MPPs,  one  could  rely  on  the  DBMS  to  come  up  with  a  data  partitioning  strategy. 
However,  it  might  be  possible  to  better  tune  performance  if  the  application  could  provide 
hints  about  the  best  partitioning  [11]  to  use.  Further  experiments  are  required  to  assess  how 
the  performance  of  these  automatic  parallelizations  would  compare  with  algorithm-specific 
parallelizations  [11]. 

The  development  time  and  code  size  using  SQL  could  be  shorter  if  one  can  get  efficient 
implementations  out  of  expressing  the  mining  algorithms  declaratively  using  a  few  SQL 
statements.    Thus,  one  can  avoid  writing  and  debugging  code  for  memory  management, 
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indexing  and  space  management  all  of  which  are  already  provided  in  a  database  system. 
However,  there  are  some  detractors  to  easy  development  using  the  SQL  alternative.  First, 
any  attached  UDF  code  will  be  harder  to  debug  than  stand-alone  C++  codes  due  to  lack 
of  debugging  tools.  Second,  stand-alone  code  can  be  debugged  and  tested  faster  when  run 
against  flat  file  data.  Running  against  flat  files  is  typically  a  factor  of  five  to  ten  faster 
compared  to  running  against  data  stored  in  DBMS  tables.  Finally,  some  mining  algorithms 
(for  instance,  neural-net  based)  might  be  too  awkward  to  express  in  SQL. 

The  ease  of  porting  of  the  SQL  alternative  depends  on  the  kind  of  SQL  used.  Within  the 
same  DBMS,  porting  from  one  OS  platform  to  another  requires  porting  only  the  small  UDF 
code  and  hence  is  easy.  In  contrast  the  Stored-procedure  and  Cache-Mine  alternatives 
require  porting  larger  lines  of  code.  Porting  from  one  DBMS  to  another  could  get  hard  for 
SQL  approach,  if  non-standard  DBMS-specific  features  are  used.  Unfortunately,  we  found 
SQL-92  implementations  (which  would  have  been  quite  portable)  to  be  unacceptable  from 
the  performance  viewpoint.  Our  preferred  SQL  implementation  relies  on  the  availability 
of  DB2's  table  functions.  Table  functions,  for  example,  in  Oracle  8  do  not  have  the  same 
interface  and  semantics  as  DB2.  Also,  if  different  features  have  different  performance 
characteristics  on  different  database  systems,  considerable  tuning  would  be  required.  In 
contrast,  the  Stored-procedure  and  Cache-Mine  approach  are  not  tied  thus  to  any  DBMS 
specific  features.  The  UDF  implementation  has  the  worst  of  both  worlds;  it  is  large  and  is 
tied  to  a  DBMS. 

The  biggest  attraction  of  the  SQL  implementation  is  inter-operability  and  usage  flex- 
ibility. The  ad  hoc  querying  support  provided  by  the  DBMS  enables  flexible  usage  and 
exposes  potential  for  pipelining  the  input  and  output  operators  of  the  mining  process  with 
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other  operators  in  the  DBMS.  However,  to  exploit  this  feature  one  either  needs  to  imple- 
ment the  mining  operators  inside  the  DBMS  or  use  table  functions  in  novel  ways.  The  first 
alternative  would  require  major  rework  in  existing  database  systems.  The  second  alter- 
native requires  table  functions  that  can  execute  SQL  statements  (a  facility  not  currently 
available  in  DB2  UDB).  Furthermore,  some  mining  operators  generate  multiple  different 
kinds  of  output  (for  instance,  model  tree  and  statistics  for  decision  trees).  In  such  cases, 
pipelining  of  mining  operators  becomes  harder  to  fit  in  existing  relational  models.  Note 
that  the  SQL  approach  presented  here  is  based  on  embedded  SQL  and  cannot  provide  oper- 
ator pipelining  and  inter-operability.  Queries  on  the  mined  result  is  possible  with  all  four 
alternatives  as  long  as  the  mined  results  are  stored  back  in  the  DBMS. 

4.8     Other  Associations  Algorithms 

There  are  recent  proposals  [26,  127]  that  attempt  to  minimize  the  number  of  data 
passes  in  the  Apriori  algorithm  to  two  or  three.  Currently,  the  Stored-procedure  approach 
makes  as  many  passes  as  the  length  of  the  largest  itemset  and  in  each  pass,  most  of  the 
time  is  spent  in  reading  the  data  from  the  DBMS.  Therefore,  by  using  these  alternative 
algorithms  it  is  possible  to  reduce  the  time  spent  by  Stored-procedure  by  as  many  times 
as  the  number  of  passes.  However,  the  Cache-Mine  approach  is  almost  a  lower  bound  on  the 
best  performance  possible  for  doing  associations  outside  the  DBMS  because  it  spends  most 
of  its  time  in  the  first  pass  which  is  the  minimum  amount  of  work  any  mining  algorithm 
that  works  outside  the  DBMS  has  to  spend;  read  all  of  the  data  at  least  once.  Since  our  SQL 
approach  is  competitive  with  the  one-pass  Cache-Mine  approach,  we  expect  our  conclusions 
to  hold  for  other  algorithms  that  spend  1.5  to  3  times  the  time  spent  by  Cache-Mine. 
These  alternative  algorithms  are  not  likely  to  help  improve  the  performance  of  our  SQL 
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implementation  because  they  are  aimed  at  reducing  the  number  of  passes  over  the  data. 
We  have  seen  that  with  our  SQL  implementations  hardly  any  time  is  spent  on  passes  beyond 
2  because  of  the  Vertical  approach.  The  time  spent  in  pass  2  is  not  reduced  because  they 
all  require  counting  support  of  all  of  C2  on  the  entire  data  anyway. 

4.9     Summary 

In  this  chapter,  we  developed  and  experimented  with  a  set  of  approaches  that  made 
use  of  the  new  object-relational  extensions  like  UDFs,  BLOBs  and  table  functions.  These 
approaches  performed  much  better  than  the  SQL-92  approaches  in  Chapter  3.  We  developed 
a  hybrid  scheme  which  picks  the  best  approach  for  each  pass  based  on  the  cost  estimates.  We 
also  compare  the  various  architectural  alternatives  both  qualitatively  and  quantitatively. 

Note  A  part  of  the  work  described  in  this  chapter  was  primarily  done  by  researchers 
from  IBM  Almaden  Research  Center.  The  author's  primary  contributions  were  the  opti- 
mizations to  the  approaches,  in  Sections  4.1.1  and  4.2.1,  the  hybrid  approach  in  Section  4.5 
and  the  cost  analysis  of  the  various  support  counting  approaches.  The  author  has  also 
contributed  to  the  performance  experiments  which  led  to  the  various  comparisons. 


CHAPTER  5 
GENERALIZED  ASSOCIATION  RULES 


In  most  real-life  applications  the  set  of  items  that  appear  in  transactions  can  be  cat- 
egorized according  to  a  taxonomy  (is-a  hierarchy)  on  the  items.  The  taxonomy  shown  in 
Figure  5.1  says  that  Pepsi  is-a  soft  drink  is-a  beverage  and  so  on.  In  general,  a  taxonomy 
can  be  represented  as  a  directed  acyclic  graph  (DAG).  Given  a  set  of  transactions  Teach  of 
which  is  a  set  of  items,  and  a  taxonomy  Tax,  the  problem  of  mining  generalized  association 
rules  is  to  discover  all  rules  of  the  form  X  ->  Y,  with  the  user-specified  minimum  support 
and  minimum  confidence.  X  and  Y  can  be  sets  of  items  at  any  level  of  the  taxonomy,  such 
that  no  item  in  Y  is  an  ancestor  of  any  item  in  X  [118].  For  example,  there  might  be  a 
rule  which  says  that  "50%  of  transactions  that  contain  Soft  drinks  also  contain  Snacks;  5% 

of  all  transactions  contain  both  these  items." 

Beverages  Snacks 

/\  /\ 

Soft  drinks     Alcoholic  drinks      Pretzels     Chocolate  bar 

/\       \ 

Pepsi  Coke  Beer 

Figure  5.1.  Example  of  a  taxonomy 

In  this  chapter,  we  present  several  SQL  formulations  of  generalized  association  rule 
mining  [126].  In  Section  5.1,  we  describe  the  input-output  formats  and  in  Section  5.2, 
we  briefly  outline  the  Cumulate  algorithm  for  generalized  association  rule  mining  [118]. 
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Table  5.1.  An  example  of  the  taxonomy  table 


Parent 

Child 

Beverages 

Soft  drinks 

Beverages 

Alcoholic  drinks 

Soft  drinks 

Pepsi 

Soft  drinks 

Coke 

Alcoholic  drinks 

Beer 

Snacks 

Pretzels 

Snacks 

Chocolate  bar 

We  present  SQL  formulations  of  ancestor  pre-computation  and  candidate  generation  in 
Sections  5.3  and  5.4  respectively.  Several  approaches  for  support  counting  based  on  SQL- 
92  and  SQL-OR  are  presented  in  Sections  5.6  and  5.7.  In  Section  5.8,  we  discuss  some 
experimental  results. 

5.1     Input-Output  Formats 

Input  format  The  transaction  table  T  has  the  same  schema,  (tid,  item)  as  in  the 
case  of  boolean  association  rules.  While  transactions  normally  contain  only  items  that  are 
leaves  in  the  taxonomy,  it  is  not  a  requirement.  The  taxonomy  table  Tax  has  two  column 
attributes,  parent  and  child.  Each  record  in  Tax  corresponds  to  an  edge  in  the  taxonomy 
DAG.  The  taxonomy  shown  in  Figure  5.1  is  represented  by  Table  5.1.  The  number  of 
children  for  the  interior  nodes  and  the  depth  of  the  DAG  are  unknown  at  table  creation 
time  and  therefore  alternative  schemas  may  not  be  convenient. 


Output  format  The  output  is  a  collection  of  all  rules  that  satisfy  the  user-specified 
minimum  support  and  minimum  confidence.  The  schema  of  the  rules  is  the  same  as  in  the 
boolean  associations  case  3.3. 
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5.2     Cumulate  Algorithm 

Srikant  and  Agrawal  [118]  present  several  algorithms:  Basic,  Cumulate,  Stratify  and 
EstMerge.  We  picked  Cumulate  for  our  SQL-formulations  since  it  has  the  best  performance. 
Stratify  and  EstMerge  are  sometimes  marginally  better  but  they  are  far  too  complicated 
to  merit  the  additional  development  cost.  Cumulate  has  the  same  basic  structure  as  the 
Apriori  algorithm  [9]  for  boolean  associations.  It  makes  multiple  passes  over  the  data  where 
in  the  kth  pass  it  finds  frequent  itemsets  of  size  k.  Each  pass  consists  of  two  phases:  In  the 
candidate  generation  phase,  the  frequent  (k  —  l)-itemsets,  F^-i,  found  in  the  previous  pass 
is  used  as  the  seed  set  to  generate  candidate  fc-itemsets  (Cjt)  that  are  potentially  frequent. 
In  the  support  counting  phase  for  each  itemset  t  G  Cfc,  the  number  of  extended  transactions 
(transactions  augmented  with  all  the  ancestors  of  its  items)  that  contains  t  is  counted.  At 
the  end  of  the  pass,  the  frequent  candidates  are  identified  yielding  Fjt-  The  algorithm 
terminates  when  Fk  or  Ck+i  becomes  empty. 

The  above  basic  structure  is  augmented  with  a  few  optimizations.  The  important  ones 
are  pruning  itemsets  containing  an  item  and  its  ancestor  and  pre-computing  the  ancestors 
for  each  item.  We  extend  the  SQL-based  boolean  association  rule  framework  in  Chapters  3 
and  4  with  these  optimizations. 

5.3     Pre-Computing  Ancestors 

In  this  section,  we  explain  how  to  compute  the  set  of  ancestors  of  each  item  using  SQL. 
We  call  x  an  ancestor  of  x  if  there  is  a  directed  path  from  x  to  x  in  Tax.  The  Ancestor 
table  is  primarily  used  for  (%)  pruning  candidates  containing  an  item  and  its  ancestor  and 
(ii)  extending  the  transactions  by  adding  all  the  ancestors  of  its  items.     The  ancestor 
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computation  can  be  expressed  in  SQL  using  a  recursive  query  as  shown  in  Figure  5.2.  The 
result  of  the  query  is  stored  in  table  Ancestor  having  the  schema  (ancestor,  descendant). 

insert  into  Ancestor 

with  R-Tax  (ancestor,  descendant)  as 

(select  parent,  child  from  Tax 
union  all 

select    p. ancestor,  c. child  from  R-Tax  p,  Tax  c 

where   p. descendant  =  c. parent) 
select  ancestor,  descendant  from  R-Tax 


ancestor,  descendant 


Figure  5.2.  Pre-computing  ancestors 

5.4     Candidate  Generation 

In  the  candidate  generation  phase,  we  use  the  frequent  itemsets  Fk-i  found  in  the 
(k  —  l)th  pass  to  generate  a  set  of  candidate  itemsets  Ck  that  contains  all  k  itemsets  such 
that  all  k  of  its  (k  -  l)-length  subsets  are  in  Fk-\.  Section  3.4.1  shows  how  to  express  this 
operation  as  a  fc-way  join  between  the  frequent  (k  —  l)-itemsets  (Ffc_i's).  We  can  use  the 
same  formulation  except  that  we  need  to  prune  from  Ck  itemsets  containing  an  item  and 
its  ancestor.   Srikant  and  Agrawal  [118]  prove  that  this  pruning  needs  to  be  done  only  in 
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the  second  pass  (for  C2).    In  the  SQL  formulation  as  shown  in  Figure  5.3,  we  prune  all 
(ancestor,  descendant)  pairs  from  C2  which  is  generated  by  joining  Fi  with  itself. 

insert  into  C2 

(select  I\.item\,I2.item\  from  FiIi,FxI2 

where   I\.item\  <  I2.item\) 

except 
(select  ancestor,  descendant  from  Ancestor  union 
select    descendant,  ancestor  from  Ancestor) 

C2 

t 

EXCEPT 


II. item  1  <I2.iteml     JXTl  .  UNION 

l^T    \J       ancestor,       U1,1U"     descendant, 
descendant/         \  ancestor 


Fl  II  Fl  12      Ancestor  Ancestor 

Figure  5.3.  Generation  of  C2 

5J> Support  Counting  to  Find  Frequent  Itemsets 

We  use  the  candidate  itemsets  Ck,  the  data  table  T  and  the  ancestor  table  Ancestor 
to  count  the  support  of  the  itemsets  in  Ck.  We  consider  two  categories  of  SQL  imple- 
mentations based  on  SQL-92  and  SQL-OR  in  Sections  5.6  and  5.7  respectively.  All  the 
SQL  approaches  developed  for  boolean  associations  in  Chapters  3  and  4  can  be  extended 
to  handle  taxonomies.  However,  we  present  the  extensions  to  only  a  few  representative 
approaches.  In  particular,  we  consider  the  KwayJoin  approach  from  SQL-92  and  Vertical 
and  Gather  Join  from  SQL-OR. 


!)4 


5.6     Support  Counting  Using  SQL-92 
5.6.1     K-way  Join 

In  each  pass  k,  we  join  the  candidate  itemsets  Ck  with  k  copies  of  an  extended  trans- 
action table  T*  (defined  below),  and  follow  it  up  with  a  group  by  on  the  itemsets. 

The  extended  transaction  table  T*  is  obtained  by  augmenting  T  to  include  tid,  item 
entries  for  all  ancestors  of  items  appearing  in  T.  This  can  be  formulated  as  a  SQL  query 
as  shown  in  Figure  5.4. 

Query  to  generate   T* 

select  item,  tid  from  T  union 

select  distinct  A. ancestor  as  item,  T.tid 

from  T,  Ancestor  A 

where  A. descendant  =  T.item) 

T* 

t 

UNION 
T.tid,  A. ancestor  as  item^^^        ^^-^^ 

T.item  =  A. descendant    L^^-CJ  T 


T  Ancestor  A 

Figure  5.4.  Transaction  extension  subquery 

The  select  distinct  clause  is  used  to  eliminate  duplicate  records  due  to  extension  of 
items  with  a  common  ancestor.  Note  that  for  this  approach  we  do  not  materialize  T*. 
Instead  we  use  the  SQL  support  for  common  subexpressions  (with  construct)  to  pipeline 
the  generation  of  T*  with  the  join  operations.  The  final  SQL  query  and  the  corresponding 
tree  diagram  are  shown  in  Figure  5.5. 
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insert  into  Fk 

with  T*(tid,  item)  as  (Query  for  T*  defined  above) 

select  itemi,  . . .  itemk,  count(*) 

from         Ck,T*  tu...T*  tk 

where       ^i.item  =  Ck-item\  and 

fjt-item  —  Ck-iterrik  and 
ii.tid  =  <2-tid  and 

£jt_i.tid  =  ijfc.tid 

group  by  itemi,item2  ...iterrik 
having     count(*)  >  :minsup 


having 
count(*)  >  :minsup 


Group  by 
iteml itemk 


Ck.iteml  =  tl.item 


Ck. itemk  =  tk.item 


tl.nd  =  tk.tid     [^><C] 


tl.tid  =  t2.tid       P><3 


T*    tk 


T*    tl 


T*    t2 


Figure  5.5.  Support  counting  by  K-way  join 
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The  3wayJoin  and  2GroupBy  approaches  for  boolean  associations  can  be  extended  to 
handle  taxonomies,  in  a  similar  way  by  replacing  T  with  T*  in  the  corresponding  queries. 

5.6.2     Subquery  Optimization 

The  basic  KwayJoin  approach  can  be  optimized  to  make  use  of  common  prefixes  be- 
tween the  itemsets  in  Ck  by  splitting  the  support  counting  phase  into  a  cascade  of  k  sub- 
queries.  The  subqueries  in  this  case  are  exactly  similar  to  those  for  boolean  associations 
presented  in  3.5.4  except  for  the  use  of  T*  instead  of  T. 

5.7     Support  Counting  Using  SQL-OR 

In  this  section  we  present  three  approaches  that  make  use  of  the  object-relational 
features  of  SQL.  We  also  present  a  cost-based  analysis  of  the  execution  time  of  the  various 
approaches. 

5.7.1     Gather  Join 

The  Gather  Join  approach,  which  is  based  on  the  use  of  table  functions  [80],  generates 
all  possible  fc-item  combinations  of  extended  transactions,  joins  them  with  the  candidate 
table  Ck  and  counts  the  support  by  grouping  the  join  result.  The  extended  transactions 
T*  (defined  in  Section  5.6.1)  are  passed  to  the  table  function  GatherComb-K  in  the  (tid, 
item)  order.  A  record  output  by  the  table  function  is  a  fc-item  combination  supported  by 
a  transaction  and  has  k  attributes  TJtmu. . .  ,TJtmk.  SQL  queries  for  this  approach  is 
presented  in  Figure  5.6. 

The  special  optimization  for  pass  2  and  the  variations  of  the  Gather  Join  approach, 
namely  GatherCount,  GatherPrune  and  Horizontal  (refer  Section  4.1)  are  also  applicable 
here. 
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insert  into  Fk  select  item\, . . .  ,item,k,  count(*) 

from  Cfc, 

(select  t2.TJtmi, . . . ,  t2-TJtmk  from  T*  t\, 
table  (GatherComb-K(^.tid,  fi.item))  as  £2) 

where  t^-T j,tm\  =  Ck-item\  and 

t2-TJtrrik  =  Ck-iterrik 
group  by  Ck.itemi, . . . ,  Ck .iterrik 
having  count(*)  >  rminsup 


having 
count(*)  >  :minsup 

t 

Group  by 

iteml , ,itemk: 

T_itml  =  Ck.itemi     ^^^  T 

T_itmk  =  Ck.itemk     ^^ 


Table  function 

GatherComb-K 

♦ 
Order  by 

tid,  item 


T* 


Ck 


Figure  5.6.  Support  counting  by  Gather  Join 
5.7.2     GatherExtend 

This  is  a  variation  of  the  Gather  Join  approach  where  we  push  the  transaction  exten- 
sion inside  the  table  function.  For  each  item,  we  create  an  ancestor-list  (stored  as  a  BLOB) 
which  contains  a  list  of  ancestors  of  that  item.  This  can  be  accomplished  using  similar  SQL 
queries  as  in  the  tid- list  creation  phase  of  the  Vertical  approach  for  boolean  associations 
(refer  Section  4.2).  The  (item,  ancestor-list)  pairs  are  stored  in  a  new  AncListTable.  Note 
that  ancestor-list  needs  to  be  created  only  for  items  that  appear  in  the  input  transactions. 
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In  the  support  counting  phase,  we  join  the  transaction  table  T  with  the  AncListTable  and 
the  resulting  (tid,  item,  ancestor-list)  tuples  are  passed  to  a  table  function  GExtendComb-K 
in  tid  order.  The  table  function  collects  all  the  items  and  the  corresponding  ancestor-lists 
of  the  same  transaction,  extends  it  and  outputs  k-item  combinations  of  the  extended  trans- 
action. Figure  5.7  illustrates  the  SQL  queries  for  this  approach.  This  approach  was  not 
implemented  because  it  might  not  perform  better  than  the  Vertical  approach  explained 
next. 

This  approach  can  be  combined  with  the  GatherPrune  approach  (refer  Section  4.1 
to  prune  out  item  combinations  that  are  not  candidates.  In  that  case,  we  can  also  delete 
from  the  extended  transactions  items  that  are  not  present  in  any  of  the  candidates,  before 
generating  the  item  combinations. 

5.7.3     Cost  Analysis 

In  the  cost  analysis  we  use  the  notations  of  Table  5.2  in  addition  to  the  notations 
introduced  for  boolean  associations  in  Table  3.2. 

Table  5.2.  Additional  notations  used  for  cost  analysis 


number  of  records  in  the  input  taxonomy  table 

average  depth  of  the  taxonomy 

number  of  leaf  nodes  in  the  taxonomy 

number  of  records  in  the  Ancestor  table  fa  I  *  d 

average  number  of  items  in  an  extended  transaction  =  N  *d 

total  number  of  records  after  transaction  extension  =  R*  d 

cost  of  generating  a  k-item  combination  using  GExtendComb-K 

cost  of  sorting  n  records 


The  cost  of  Gather  Join  includes  the  cost  of  extending  the  transactions  by  joining  the 
transaction  table  with  the  ancestor  table,  generating  A;-item  combinations,  joining  them 
with  Ck  and  grouping  the  join  result  to  count  the  support.    Note  that  the  transaction 
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insert  into  Fk  select  itemi, . . .  ,item,k,  count(*) 
from  Cfc,  (select  t2.TJ.tm1, . . . ,  t2.TA.tmk 

from      (select  tid,  item,  ancestor-list 
from     T,  AncListTable  A 
where  T.item  =  A. item)  as  t\, 
table  (GExtendComb-K(  ii.tid,  ii.item,  t\. ancestor- list))  as  £2) 
where  t2-TJtmi  =  Ck-itemi  and 

t2.Tl.tmk  =  Ck-itemk 
group  by  Ck-itemi, . . . ,  Ck-itemk 
having  count(*)  >  :minsup 


T_itml  =Ck.iteml 
T_itmk  =  Ck.itemk 


having 
count(*)  >  :minsup 

t 
Group  by 

iteml,....,itemk 

t 

IX 


Table  function 
GExtendComb-K 

T.tid,  T.item,  A. ancestor-list     T 

T.item  =  A. item   [^XCJ 


Ck 


AncListTable  A 


Figure  5.7.  Support  counting  by  GatherExtend 
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extension  cost  can  be  ignored  if  T*  is  materialized  because  in  that  case,  it  will  be  a  one- 
time cost.  The  result  size  of  the  join  which  extends  the  transactions  is  R*  and  the  number 
of  fc-item  combinations  generated,  Tk  is  C(N*,k)  *  T.  Therefore,  the  total  cost  of  the 
Gather  Join  approach  is 

join(iJ,  A,  R*)  +  order(iT )  +Tk*sk+  join(Tfc,  Ck,  S(Ck))  +  group(5(C*),  Ck) 

In  Gather/Extend,  transaction  extension  is  done  inside  the  table  function.  The  cost  of 
passing  the  ancestor-list  as  a  BLOB  is  blob(d)  since  the  average  length  of  the  ancestor-list 
is  the  same  as  the  depth  of  the  taxonomy.  The  cost  of  GatherExtend  is 

jointi?, l,R)  +  R*  blob(d)  +  Tfc  *  ek  +  join(Tfc, Ck, S{Ck))  +  group(S(Ck), Ck) 

5.7.4     Vertical 

In  this  approach,  the  transactions  are  first  converted  into  a  vertical  format  by  creating 
for  each  item  a  BLOB  (tid-list)  containing  all  tids  that  contain  that  item.  The  support  for 
each  itemset  is  counted  by  merging  the  tid-lists  of  all  its  items.  The  tid-list  of  leaf  node  items 
can  be  created  using  a  table  function  in  the  same  way  as  in  the  boolean  associations  case. 
We  present  two  approaches  for  creating  the  tid-list  of  the  interior  nodes  in  the  taxonomy 
DAG. 

The  first  approach  is  based  on  doing  the  union  of  the  descendant's  tid-lists  of  an 
interior  node.  In  the  first  phase,  we  create  an  initial  TidTable  containing  the  tid-lists  of 
the  leaf  nodes.  The  TidTable  is  joined  with  the  ancestor  table  and  the  join  result  is  passed 
in  the  order  of  the  ancestor  attribute  to  the  table  function  TUnion.  For  every  node  x,  the 
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insert  into  TidTable  select  t2.item,  ^-tid-list 
from  (select  A. ancestor,  T.tid-list 

from     TidTable  T,  Ancestor  A 

where  T.item  =  A. descendant 

order  by  ancestor)  as  t\, 

table(TUnion(fi. ancestor,  ii. tid-list))  as  t2 

t Litem,  tl.tid-list 

t 
Table  function 

TUnion 

t 
Order  by 

ancestor 
A. ancestor,  T.tid-list 

T.item  =  A. descendant     L^^^T] 


TidTable   T  Ancestor   A 


Figure  5.8.  Interior  nodes'  tid-list  generation  by  union 

table  function  collects  the  tid-lists  of  all  the  leaf  nodes  reachable  from  x,  union  them  and 
outputs  the  tid-list  for  x.  The  SQL  query  for  this  is  illustrated  in  Figure  5.8.  This  approach 
can  be  further  optimized  to  do  the  tid-list  union  of  only  the  children  of  each  interior  node. 
However,  in  the  first  phase  tid-lists  have  to  be  created  and  materialized  for  every  leaf  node 
item  irrespective  of  its  support.  This  is  very  expensive  especially  when  the  number  of  items 
is  large. 

The  second  approach  is  to  pass  T*  (defined  in  Section  5.6.1)  to  the  Gather  table 
function,  as  shown  in  Figure  5.9.  The  table  function  outputs  tid-lists  for  all  the  items  in 
the  taxonomy. 

The  support  counting  queries  are  exactly  the  same  as  for  boolean  associations  ex- 
plained in  Section  4.2.  The  cost  formula  is  also  the  same  as  in  boolean  associations  except 
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insert  into  TidTable  select  t2-iteva,  ^2-tid-list 

from  T*  t\,   table(Gather(ii.item,  ij.tid,  :minsup))  as  <2 


t2.item,  t2.tid-list 

t 
Table  function 

Gather 

t 
Order  by 

item,  tid 

1 

T* 


Figure  5.9.  Interior  nodes'  tid-list  generation  from  T* 

p* 
that  the  average  length  of  a  tid-list  in  this  case  is  -pf-,  where  FC,  is  the  sum  of  support 

counts  of  all  the  frequent  items. 

5.8     Performance  Results 

In  this  section,  we  report  the  results  of  some  of  our  performance  experiments  on  real- 
life  datasets.  We  present  only  one  set  of  results  since  our  emphasis  here  is  on  showing  that 
the  boolean  association  rule  framework  can  be  easily  extended  to  handle  more  complex 
mining  computations  like  generalized  association  rules. 

The  experiments  reported  here  were  performed  on  Version  5  of  IBM  DB2  Universal 
Server  installed  on  a  RS/600  Model  140  with  a  200MHz  CPU,  256  MB  main  memory  and  a 
9  GB  disk  with  a  measured  transfer  rate  of  8  MB/sec.  Figure  5.10  shows  the  performance  of 
three  approaches:  Vertical,  GatherJoin  (shown  as  Gcj)  and  Stored-procedure  (shown  as 
Sproc).  The  chart  shows  the  preprocessing  time  and  the  time  taken  for  the  different  passes. 
For  Vertical  the  preprocessing  time  includes  ancestor  pre-computation  and  tid-list  cre- 
ation times,  where  as  for  GatherJoin  it  is  just  the  time  for  ancestor  pre-computation.  In  the 
stored  procedure  approach,  the  mining  algorithm  is  encapsulated  as  a  stored  procedure  [32] 
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Performance  comparison 


3  Prep  ■  Pass  1  □  Pass  2  D  Pass  3  ■  Pass  4  B  Pass  5  ■  Pass  6 
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Mail  order  data: 

Total  number  of  records  =  2.5  million 

Number  of  transactions  =  568000 

Number  of  items  (leaf  nodes  in  taxonomy  DAG)  =  85161 

Total  number  of  items  (including  interior  nodes)  =  228428 

Max.  depth  of  the  taxonomy  =  7 

Avg.  number  of  children  per  node  =  1.6 

Max.  number  of  parents  =  3 


Figure  5.10.  Comparison  of  different  SQL  approaches 

which  runs  in  the  same  address  space  as  the  DBMS.  For  the  Stored-procedure  experi- 
ment, we  used  the  generalized  association  rule  implementation  provided  with  the  IBM  data 
mining  product,  Intelligent  Miner  [71].  For  all  the  support  values,  the  Vertical  approach 
performs  equally  well  as  the  Stored-procedure  approach.  In  some  of  the  experiments 
on  other  datasets,  the  Vertical  approach  performed  better  than  the  Stored-procedure 
approach.  The  Gather  Join  approach  is  worse  mainly  due  to  the  large  number  of  item  com- 
binations generated.  In  the  GatherJoin  approach,  the  extended  transactions  are  passed 
to  the  GatherComb  table  function  and  hence  the  effective  number  of  items  per  transaction 
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gets  multiplied  by  the  average  depth  of  the  taxonomy.  In  the  Gather  Join  approach,  we 
show  the  performance  numbers  for  only  the  second  pass.  Note  that  just  the  time  for  second 
pass  is  an  order  of  magnitude  more  than  the  total  time  for  all  the  passes  of  Vertical.  For 
0.5%  support,  second  pass  of  Gather  Join  took  over  20,000  seconds  while  the  total  time  for 
Vertical  was  only  about  3000  seconds.  We  did  not  do  extensive  experimentation  here 
because,  based  on  the  SQL  formulations  and  the  analysis,  we  expect  similar  performance 
observations  as  in  the  case  of  boolean  association  rules. 

5.9     Summary 

In  this  chapter,  we  presented  various  SQL  formulations  for  mining  generalized  associa- 
tion rules.  It  shows  that  the  boolean  association  rule  framework  can  be  easily  extended  for 
generalized  association  rules.  The  major  additions  were  to  "extend"  the  input  transaction 
table  (transform  T  to  T*)  and  to  pre-compute  the  ancestors. 

We  have  presented  here  the  extensions  only  for  a  few  representative  SQL  approaches 
for  boolean  association  rules  in  Chapters  3  and  4.  The  other  approaches  can  be  extended 
in  a  similar  manner.  We  expect  the  performance  trends  from  boolean  association  rules 
to  hold  for  all  these  approaches  for  generalized  associations  also.  There  could  be  better 
ways  of  handling  the  taxonomy  like  the  EstMerge  and  Stratify  algorithms  presented  in 
Srikant  and  Agrawal  [118].  However,  with  the  currently  available  SQL  operators  they  will 
be  too  awkward  to  program.  One  possibility  is  to  push  some  of  those  optimizations  inside 
user-defined  functions  and  it  remains  to  be  seen  how  they  will  influence  the  performance. 


CHAPTER  6 
SEQUENTIAL  PATTERNS 


Sequential  pattern  mining  utilizes  the  time  associated  with  the  transaction  data  to 
find  frequently  occurring  patterns.  A  sequential  pattern  is  an  ordered  list  (sequence)  of 
itemsets  (refer  Section  1.2.2  for  a  brief  description  of  sequential  patterns).  Given  a  set  of 
data-sequences  each  of  which  is  a  list  of  transactions  ordered  by  the  transaction  time,  the 
problem  of  mining  sequential  patterns  is  to  discover  all  such  patterns  with  a  user-specified 
minimum  support. 

We  present  several  SQL  formulations  of  sequential  pattern  mining  in  this  chapter.  In 
Section  6.1,  we  describe  the  input-output  formats  and  in  Section  6.2,  we  briefly  outline 
the  GSP  algorithm  for  sequential  pattern  mining  [120].  SQL-based  sequential  pattern 
candidate  generation  is  presented  in  Section  6.3.  In  Sections  6.5  and  6.6,  we  present  several 
SQL  formulations  for  support  counting  based  on  SQL-92  and  SQL-OR.  In  Section  6.7,  we 
briefly  mention  how  to  handle  taxonomies  over  items.  We  concentrate  on  showing  that 
sequential  pattern  mining  can  also  be  handled  within  the  same  SQL-based  framework.  We 
expect  the  performance  trends  from  association  rule  mining  to  hold  here  also. 

6.1     Input-Output  Formats 

6.1.1     Input  Format 

The  input  table  D  of  data-sequences  has  three  column  attributes:  sequence  identifier 
(sid),  transaction  time  (time)  and  item  identifier  (item).      Every  data-sequence  typically 
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contains  several  transactions  each  of  which  is  a  set  of  items.  The  data-sequence  table 
contains  multiple  rows  corresponding  to  different  items  that  belong  to  transactions  in  the 
data-sequence.  The  taxonomy  on  the  items  is  represented  in  the  same  way  as  for  the 
generalized  association  rules. 

6.1.2     Output  Format 

The  output  is  a  collection  of  frequent  sequences.  A  sequence  which  is  represented  as  a 
tuple  in  a  fixed-width  table  consists  of  an  ordered  list  of  elements  where  each  element  is  a  set 
of  items.  The  schema  of  the  frequent  sequences  table  is  (itemi,  enoi, . . . ,  itemk,enoh,len). 
The  len  attribute  gives  the  length  of  the  sequence,  that  is,  the  total  number  of  items  in 
all  the  elements  of  the  sequence.  The  eno  attributes  stores  the  element  number  of  the 
corresponding  items.  For  sequences  of  smaller  length,  the  extra  column  values  are  set  to 
NULL.  For  example,  if  k  =  4,  the  sequence  ((computer,  modem)  (printer))  is  represented 
by  the  tuple  (computer,  1,  modem,  1,  printer,  2,  NULL,  NULL,  3). 

6.2     GSP  Algorithm 

We  use  the  GSP  algorithm  of  Srikant  and  Agrawal  [120]  as  the  basis  for  our  SQL 
formulations.  The  basic  structure  of  the  GSP  algorithm  is  similar  to  that  of  the  apriori 
algorithm  for  association  rule  mining  [9],  although  the  specific  details  are  quite  different. 
Therefore,  the  SQL-based  association  rule  framework  can  be  used  for  sequential  patterns 
also.  The  algorithm  makes  multiple  passes  over  the  data-sequences  where  in  the  kth  pass 
it  finds  frequent  sequences  of  length  k.  Each  pass  consists  of  two  phases:  the  candidate 
generation  phase  and  the  support  counting  phase.  In  the  candidate  generation  phase,  the 
frequent  (k  -  l)-length  sequences,  Fk_i,  found  in  the  previous  pass  is  used  as  the  seed  set 
to  generate  candidate  fc-length  sequences  (Ck)  that  are  potentially  frequent.  The  candidate 
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generation  procedure  ensures  that  Ck  is  a  superset  of  Fk-  The  counting  phase  makes  a 
pass  over  the  input  data-sequences  to  find  the  support  for  each  candidate  sequence.  In  the 
support  counting  phase  for  each  candidate  sequence  s  G  Ck,  the  number  of  distinct  data- 
sequences  that  contains  s  is  counted.  Two  techniques  to  perform  this  operation  efficiently 
are  explained  in  Srikant  and  Agrawal  [120]:  one  which  uses  a  hash-tree  data  structure  and 
another  based  on  transforming  the  data-sequences  into  an  alternate  representation.  The 
algorithm  terminates  when  the  set  of  candidate  sequences  Ck  or  the  frequent  sequences  Fk 
becomes  empty. 

6.3     Candidate  Generation 

In  each  pass  k,  the  candidate  ^-sequences  Ck  are  generated  from  the  frequent  (k  —  1) 
sequences  Fk-\.  The  schema  of  Ck  is  the  same  as  that  of  frequent  sequences  explained 
above  in  Section  6.1,  except  that  we  do  not  require  the  len  attribute  since  all  the  tuples  in 
Ck  have  the  same  length  k. 

Candidates  are  generated  in  two  steps.  The  join  step  generates  a  superset  of  Ck  by 
joining  Fk-i  with  itself.  A  sequence  s\  joins  with  s2  if  the  subsequence  obtained  by  dropping 
the  first  item  of  si  is  the  same  as  the  one  obtained  by  dropping  the  last  item  of  s2-  This 
can  be  expressed  in  SQL  as  follows. 

insert  into  Ck  select  I\.item\,  Ii.enoi,  ...,  h.itemk-\,  I\.enok-i,  h-itemk-\, 

h.enok-i  +  I2.enok-i  -  h.enok-2 
from  Fk-\  /x.Ffc-i  h 

where         I\.itemi  =  h-itemi  and 

I\.iterrik-\  =  I2.ite,mk-<2.  and 
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I\.enoz  —  I\.eno2  =  l2-eno2  —  h-enoi  and 


Ii.enok-i  -  h.enok-2  =  h-enok-2  ~  h-enok-3 


In  the  above  query,  subsequence  matching  is  expressed  as  join  predicates  on  the  attributes 
of  Ffc_i.  Note  the  special  join  predicates  on  the  eno  fields  that  ensure  that  not  only  do 
the  joined  sequences  contain  the  same  set  of  items  but  that  these  items  are  grouped  in  the 
same  manner  into  elements.  The  result  of  the  join  is  the  sequence  obtained  by  extending 
s\  with  the  last  item  of  s2-  The  added  item  becomes  a  separate  element  if  it  was  a  separate 
element  in  s2,  and  part  of  the  last  element  of  si  otherwise.  In  our  representation  of  the 
candidate  sequence,  enok-i  and  enok_2  determine  if  the  added  item  is  a  separate  element. 
For  example,  let  F3  be  {((1,2)  (3)),  ((1,2)  (4)),  ((1)  (3,4)),  ((1,3)  (5)),  ((2)  (3,4)), 
((2)  (3)  (5))}.  In  the  join  step,  the  sequence  ((1,2)  (3))  joins  with  ((2)  (3,4))  to  generate 
((1,2)  (3,4))  and  with  ((2)  (3)  (5))  to  generate  ((1,2)  (3)  (5)).  There  are  no  other  join 
compatible  sequences  in  F3. 

In  the  prune  step,  all  candidate  sequences  that  have  a  non-frequent  contiguous  (k-1)- 
subsequence  are  deleted.  In  the  above  example,  the  sequence  ((1, 2)  (3)  (5))  will  be  deleted 
since  its  contiguous  subsequence  ((1)  (3)  (5))  is  not  frequent. 

We  perform  both  the  join  and  prune  steps  in  the  same  SQL  statement  by  writing 
the  above  query  as  a  fc-way  join  as  shown  in  Figure  6.1.  For  any  Ar-sequence  there  are  at 
most  k  contiguous  subsequences  of  length  (k  -  1)  for  which  Fk_x  needs  to  be  checked  for 
membership.  Note  that  all  (k-  1) -subsequences  may  not  be  contiguous  because  of  the  max- 
gap  constraint  between  consecutive  elements.  The  join  predicates  on  h  and  I2  remain  the 
same.   The  join  of  h  and  I2  generates  a  A;-sequence  (Ii.ttemu. . . ,  7i.itemfc_1,/2.«femfc_1) 
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where  two  of  its  (k  —  l)-subsequences  are  known  to  be  frequent;  it  is  generated  from  two  such 
(k  —  l)-subsequences.  Membership  checks  for  the  other  (k  —  2)  subsequences  are  performed 
using  additional  joins.  The  join  predicates  for  these  joins  are  enumerated  by  dropping  one 
item  at  a  time  from  the  generated  ^-sequence.  We  first  drop  item-i  and  if  it  results  in  a 
contiguous  subsequence,  we  check  for  its  membership  in  F^-x  by  the  join  with  I3.  For  the 
join  with  Ir,  we  drop  itemT-\. 

The  OR  clause  in  the  join  predicate  is  to  avoid  checking  for  non-contiguous  subse- 
quences that  are  formed  when  enor+i  —  enor_i  =  2  When  there  is  no  max-gap  constraint, 
the  join  predicates  will  not  contain  the  OR  part. 

While  joining  Fi  with  itself  to  get  C2,  we  need  to  generate  sequences  where  both  the 
items  appear  as  a  single  element  as  well  as  two  separate  elements.  Also  note  that  the  prune 
step  will  not  delete  any  candidate  sequences.  The  generation  of  C2  is  expressed  in  SQL  as 

insert  into  C2  (select  Ii.itemi,  1,  I2.itemi,  2 

from  Fi  I\,Fi  I2)  union 

(select  I\.itemi,  1,  I2.itemi,  1 

from  Fi  I\,F\  I2 

where         Ii.itemi  <  l2.itemi) 

For  instance,  if  Fj  contains  {(1),  (2)},  C2  will  have  {((1)  (1)),  ((1)  (2)),  ((2)  (1)),  ((2)  (2)), 
((1,2))}. 

6.4     Support  Counting  to  Find  Frequent  Sequences 

In  each  pass  k,  we  use  the  candidate  table  Ck  and  the  input  data-sequences  table  D 
to  count  the  support.  We  consider  SQL  implementations  based  on  SQL-92  and  SQL-OR 
in  Sections  6.5  and  6.6  respectively. 
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(Drop  item2) 
II. item  1 
Il.item3 

I2.item_k- 1 
Il.eno3  -  Il.enol 
Il.eno4  -  Il.eno3 


(Drop  item_k-l) 
II. item  1 

Il.item_k-2 
I2.item_k-1 
Il.eno2  -  Il.enol 


=  Ik.iteml  AND 

=  Ik.item_k-2  AND 
=  Ik.item_k-1  AND 
=  Ik.eno2  -  Ik.enol  AND 


Il.eno  k-2  ; 


1 1  .eno_k- 1  +  I2.eno_k- 1  -  I2.eno_k-2 

Ik.eno_k- 1  -  Ik.eno_k-2  OR 

1 1  .eno_k- 1  +  I2.eno_k- 1  -  I2.eno_k-2  -  1 1  .eno  k-2  =  2 


=  B.iteml   AND 
=  I3.item2  AND 

=  I3.item_k-l   AND 

=  I3.eno2-I3.enol  AND 

=  I3.eno3  -  I3.eno2  AND 


X 


F_k-1    Ik 


XI 


I2.eno_k-l  -  I2.eno_k-2  =  I3.eno_k- 
Il.eno3-  Il.enol  =  2 

=  I2.itemI   AND 


1  -  I3.eno_k-2  OR 


Il.item2 

Il.item_k-1  =  I2.item_k-2  AND 

1 1  ,eno3  - 1 1  .eno2  =  I2.eno2  -  I2.eno  1  AND 

Il.eno_k-l  -  Il.eno_k-2  =  J2.eno_k-2  -  I2.eno_k-3 


X 


FJc-1    13 


F_k-1    II 


F  k-1     12 


Figure  6.1.  Candidate  generation  for  any  k 


I  Litem  I 
Il.item3 
I2.item3 
Il.eno3  - 
I2.eno3  - 
Il.eno3  - 


(Drop  item3) 
1 1  .item  1  =  I4.item  1  AND 

1 1  .item2  =  I4.item2  AND 

I2.item3  =  I4.item3  AND 

Il.eno2  -  Il.enol  =  I4.eno2  -  I4.enol  AND 
Il.eno3  +  I2.eno3  -  I2.eno2  -  Il.eno2  = 
I4.eno3  -  I4.eno2  OR 
Il.eno3  +  I2.eno3  -  I2.eno2  -  Il.eno2  =  2 

(Drop  item2) 

=  13. item  1  AND 

=  I3.item2  AND 

=  I3.item3  AND 
Il.enol  =  I3.eno2  -  I3.enol  AND 
I2.eno2  =  I3.eno3  -  I3.eno2  OR 
Il.enol  =2 


X 


XI 


Il.item2  =I2.itemlAND 

Il.item3  =  I2.item2  AND 

Il.eno3  -  Il.eno2  =  I2.eno2  -  I2.enol 


X 


F3     II  F3     12 

Figure  6.2.  Candidate  generation  for  A;  =  4 
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6.5     Support  Counting  Using  SQL-92 
6.5.1     K-wav  Join 

In  the  kth  pass,  we  join  the  candidate  table  Ck  with  k  copies  of  the  data-sequence 
table  D  and  group  the  join  result  on  the  candidate  sequences  as  shown  in  Figure  6.3.  This 
approach  is  very  similar  to  the  KwayJoin  approach  for  association  rules  (refer  Section  3.5.1) 
except  for  the  following  two  key  differences. 

1.  We  use  select  distinct  to  ensure  that  only  distinct  data-sequences  are  counted. 

2.  Second,  we  have  additional  predicates  between  sequence  numbers  that  are  denoted 
as  PRED(A;)  in  the  query  for  brevity.  The  predicates  PRED(A;)  is  a  conjunct  (and)  of 
terms  Pij{k)  corresponding  to  each  pair  of  items  from  Ck.  Pij(k)  is  expressed  as 

(Ck.enoj  —  Ck-enoj,  and  &bs(dj.time  -  di.time)  <  window-size)  or 
{Ck-enoj  =  Ck-enoi  +  1  and  dj.time  -  di.time  <  max-gap  and 

dj.time  —  di.time  >  min-gap)  or 

{Ck-enoj  >  Ck-enoi  +  1) 

Intuitively,  these  predicates  check  (a)  if  two  items  of  a  candidate  sequence  belong 
to  the  same  element,  then  the  difference  of  their  corresponding  transaction  times  is 
at  most  window-size  and  (b)  if  two  items  belong  to  adjacent  elements,  then  their 
transaction  times  are  at  most  max-gap  and  at  least  min-gap  apart. 

We  compute  the  frequent  1-sequences  by  grouping  the  data-sequences  table  on  the 
item  attribute,  counting  the  number  of  distinct  sequences  in  which  the  item  is  present  and 
filtering  the  non-frequent  items.  The  SQL  query  for  this  computation  is  given  below. 
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insert  into  F\  (select  item,  1,  count(*) 
from  (select  distinct  sid,  item 

from     D)  as  t 
group  by  item 
having  count(*)  >  :minsup 

6.5.2     Subquery  Optimization 

The  subquery  optimization  for  association  rules  can  be  applied  for  sequential  patterns 
also  by  splitting  the  support  counting  query  in  pass  k  into  a  cascade  of  k  subqueries.  The 
predicates  ptJ  can  be  applied  either  on  the  output  of  subquery  Qk  or  sprinkled  across 
the  different  subqueries  (shown  as  SubQ-PRED(Z)  in  Figure  6.4).  The  predicates  SubQ- 
PRED(/)  is  a  conjunct  of  terms  pij(l)  corresponding  to  each  pair  of  items  from  dt,  the 
distinct  Z-item  prefixes  of  Ck-  Pij(l)  is  expressed  as 

(di.enoj  =  di.enoi  and  &bs(timej  -  timei)  <  window-size)  or 
(di.enoj  =  di.enoi  +  1  and  timej  —  timei  <  max-gap  and 

timej  —  timei  >  min-gap)  or 

(di.enoj  >  di.enoi  +  1) 

6.6     Support  Counting  Using  SQL-OR 
6.6.1     Vertical 

This  approach  is  similar  to  the  Vertical  approach  for  association  rules  (refer  Sec- 
tion 4.2),  where  we  convert  the  transaction  table  into  a  vertical  format.  For  each  item,  we 
create  a  BLOB  (s-list)  containing  all  (sid,  time)  pairs  corresponding  to  that  item.  We  use 
a  table  function  Gather  for  creating  the  s-lists.    The  sequence  table  D  is  scanned  in  the 
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insert  into  Fk 

select  itemi,  eno\, . . .  ,item.k,  enok,  count (*) 

from    (select  distinct  di.sid,  item\,  enoi, . . .  ,itemk,  eno^ 

from     Ck,  D  di,. . .  ,D  <4 

where  di.item  =  Ck-itemi  and 

di.item  =  Ck-iterrik  and 
di.sid  =  ^2-sid  and 

di.sid  =  dfc.sid  and 

PRED(&) 
)  as  t 
group  by  item\,  enoi,. . .  ,iterrik,  enof. 
having  count(*)  >  :minsup 


having 
count(*)  >  rminsup 

t 
Group  by 

iteml,...,itemk, 

enol,...,enok 


PRED(k) 
Ck.iteml  =  d  Litem 

Ck.itemk  =  dk.item 


dl  .sid  =  dk.sid  [^><C1 


Ck 


dl.sid  =  d2.sid    j^XCj1 


D    dk 


D    dl 


D    d2 


Figure  6.3.  Support  counting  by  K-way  join 
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insert  into  F^ 

select  itemi,enoi,. . .  ,itemk:enoh,  count(*) 

from  (select  distinct  sid,itemi,enoi, . . . , 

itemk,enok  from  (Subquery  Q^)  ) 
group  by  item\,eno\, . . .  ,item,k,enok 
having  count(*)  >  :minsup 

Subquery  Qi  (for  any  /  between  1  and  k): 

select  itemi,enoi,time\i . .  .itemi,enoi,ti.time,  sid 
from  D  ti,  (Subquery  Qi-i)  as  rj_i, 
(select  distinct  item\,eno\, . . . 

itemi,enoi  from  Ck)  as  d\ 
where  ri_i.item\  =  di.itemi  and  . . .  and 

r[_x.item[_i  =  d/.i£em/_1and 

ri^i.sid  =  ti.sid  and 

ti.item  —  di.itemi  and  SubQ-PRED(Z) 
Subquery  Q0:  No  subquery  Q0. 


Subquery  Q_I 

t 

item_l,  eno_l,  time_l,...,item_],  enoj,  tl.time,  sid 


f 


iv.  '  ^s\    r  1-1. sid  =  tl.sid 
SubQ-PREDa)p><qa.item  =  di.itemi 

r_l-l.iteml      =.dl.iteml  ^^  ^s. 

rj-l.itemj-l  =dl.item_l-l     P><^  D  tl 

Subquery  QJ-1  select  distinct 

item_l,eno  l,...,item_l,  eno_l 

f 
Ck 

Figure  6.4.  Subquery  optimization  for  KwayJoin  approach 
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(item,  sid,  time)  order  and  passed  to  the  function  Gather,  which  collects  the  (sid,  time) 
attribute  values  of  all  tuples  of  D  with  the  same  item  in  memory.  For  items  that  meet 
the  minimum  support  criterion,  the  function  outputs  a  (item,  s-list)  pair.  The  s-lists  are 
maintained  sorted  using  sid  as  the  major  key  and  time  as  the  minor  key,  and  is  stored  in  a 
new  SlistTable  with  the  schema  (item,  s-list). 

In  the  support  counting  phase,  we  collect  the  s-lists  of  all  the  items  of  a  candidate 
and  intersect  them  to  determine  the  data-sequences  containing  that  sequence.  For  each 
candidate  sequence  in  Ck,  we  select  from  the  SlistTable  the  s-lists  corresponding  to  the 
k  items  in  the  sequence  and  pass  them  to  a  UDF  Countlntersect-K  along  with  the  eno 
attributes  of  the  candidate,  as  shown  in  Figure  6.5.  The  UDF  intersects  the  s-lists  and  does 
the  support  counting.  Eventhough  this  approach  is  the  same  as  the  one  for  associations, 
the  Countlntersect-K  function  is  quite  different. 

The  UDF  first  finds  the  data-sequence  containing  all  the  items  and  marks  the  bound- 
aries of  it  in  each  of  the  s-lists.  Note  that  the  s-lists  are  maintained  sorted  on  (sid,  time). 
The  UDF  also  groups  the  s-lists  according  to  the  elements  of  the  candidate  sequence  us- 
ing the  values  of  the  eno  parameters.  For  determining  whether  the  candidate  is  contained 
in  the  data-sequence,  we  use  an  algorithm  similar  to  the  one  described  in  Srikant  and 
Agrawal  [120].  The  details  of  this  procedure  are  different  due  to  the  vertical  representation 
of  data-sequences.  If  the  candidate  is  contained  in  the  data-sequence  the  support  count  is 
incremented  and  the  same  steps  are  repeated  for  the  subsequent  data-sequences. 

The  intersect  operation  can  be  decomposed  to  share  it  across  sequences  having  common 
prefixes  similar  to  the  subquery  optimization  in  the  KwayJoin  approach.  In  this  approach 
we  need  to  use  two  UDFs,  Intersect  and  Count.  The  former  intersects  the  s-lists  to  filter 
out  the  data-sequences  that  does  not  contain  all  the  items  of  the  candidate.   A  new  s-list 
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insert  into  Fk 

select  t.itemi,  t.eno\, . . .  ,t.itemk,  t.  enok,  cnt 

from  (select  item\,  eno\,. . .  ,itemki  enok, 

CountIntersect-K(Cfc.enoi, . . . ,  Ck.enok,  c^  .s-list,. . .  ,dfc.s-list, 
window-size, min-gap,max-gap)  as  cnt 

from     Ck,  SlistTable  d\,...  ,SlistTable  dk 

where  di.item  =  Ck.item,i  and 

dfc.item  =  Ck-itemk)  as  t 
where         cnt  >  :minsup 

iteml,...,itemk,  enol,...,enok,  cnt 

t 

cnt  >  rminsup 

A 
cnt 

Countlntersect-K  (UDF) 


dl.s-list,...,dk.s-list 


Ck.  item  1  —  d  Litem     p^^^^ 
Ck.itemk  =  dk.item     '^r" 


SlistTable  dl,...,  SlistTable  dk 

Figure  6.5.  Support  counting  by  Vertical 

corresponding  to  the  intersection  is  created  and  is  passed  to  Count  which  counts  the  number 
of  data-sequences  containing  the  candidate.  In  the  later  passes  of  the  GSP  algorithm,  this 
approach  might  be  better  since  several  candidates  may  share  common  prefixes  and  also 
the  length  of  the  intersected  s-list  is  likely  to  be  small.  However,  in  the  approach  using 
Countlntersect-K,  the  intersected  s-list  is  not  materialized  since  the  UDF  interleaves  the 
intersection  and  counting.  Therefore,  it  is  suitable  for  the  earlier  passes. 
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6.6.2     GatherJoin 

The  GatherJoin  approach  for  association  rules  (refer  Section  4.1)  can  be  extended 
to  mine  sequential  patterns  also.  In  this  approach,  we  first  Gather  all  the  (tid,  item)  pairs 
corresponding  to  fixed  values  of  sid.  We  then  generate  all  possible  A;-sequences  using  a 
table  function,  join  them  with  Ck  and  group  the  join  result  to  count  the  support  of  the 
candidate  sequences.  The  time  constraints  are  checked  on  the  table  function  output  using 
join  predicates  PRED(fc)  as  in  the  KwayJoin  approach.  The  number  of  tuples  generated 
by  the  table  function  can  be  reduced  by  applying  the  max-gap,  min-gap  and  window-size 
constraints  inside  the  table  function.  However,  PRED(Ar)  is  still  required  to  check  if  the 
sequences  in  Ck  are  contained  in  the  generated  fe-sequences.  The  items  in  a  candidate 
sequence  are  not  lexicographically  ordered.  Therefore,  a  data-sequence  containing  n  items 
can  potentially  support  nk  A>sequences.  For  real-life  datasets,  the  value  of  n  is  large  and 
hence  this  approach  will  generate  too  many  sequences,  thereby  making  it  less  attractive 
in  terms  of  performance.  However,  an  approach  similar  to  Gather/Count  will  be  feasible 
where  we  keep  the  candidate  sequences  inside  the  table  function  and  generate  only  sequences 
present  in  the  candidate  set.  The  GatherJoin  approach  was  not  implemented  for  sequential 
patterns. 

6.7     Taxonomies 

The  approaches  to  handle  taxonomies  for  association  rules  (refer  Chapter  5)  are  ap- 
plicable for  sequential  patterns  also.  The  basic  idea  is  to  replace  each  data-sequence  d  with 
an  "extended-sequence"  d'.  Each  transaction  of  d  is  extended  with  all  the  ancestors  of  its 
items  to  get  d'.  In  order  to  optimize  the  performance,  we  pre-compute  the  ancestors  and 
prune  candidates  with  an  element  that  contains  both  an  item  and  its  ancestor. 
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6.8     Summary 

In  this  chapter,  we  presented  various  SQL  formulations  for  mining  sequential  patterns. 
We  extended  a  few  representative  association  rule  mining  approaches  in  order  to  show  that 
sequential  pattern  mining  can  also  be  handled  with  in  the  SQL  framework.  The  major 
changes  were  in  the  join  predicates  for  candidate  generation  and  support  counting. 

We  did  not  do  extensive  experimentation  for  the  sequential  pattern  mining  approaches. 
Among  the  SQL-based  approaches,  the  Vertical  approach  performed  the  best  in  our  exper- 
iments. The  SQL  queries  for  the  higher  numbered  passes  becomes  too  long  since  there  are 
several  join  predicates  and  as  a  result  the  optimizer  does  not  generate  optimal  execution 
plans.  Further,  in  the  Vertical  approach,  the  user-defined  functions  for  support  counting  are 
much  more  complex  than  their  association  rule  counterparts.  We  also  did  not  implement 
the  extensions  to  handle  taxonomies  in  this  case. 


CHAPTER  7 
INCREMENTAL  MINING 


Data  mining  discovers  information  within  data  warehouses  and  finds  answers  to  ques- 
tions about  your  data  that  you  haven't  thought  to  ask.  The  rules  discovered  from  a  database 
only  reflect  the  current  state  of  the  database.  In  order  to  make  the  discovered  rules  reliable 
and  useful,  large  volumes  of  data  need  to  be  collected  and  analyzed  over  a  period  of  time. 
This  entails  the  development  of  techniques  to  handle  large  volumes  of  data,  and  to  main- 
tain rules  over  a  significantly  long  period  of  time.  Therefore,  efficient  algorithms  to  update, 
maintain  and  manage  the  discovered  rules  are  central  to  the  database  mining  technology. 

Association  rule  mining  is  an  important  data  mining  problem  which  finds  applications 
in  various  business  domains.  Association  rules  are  discovered  from  a  transaction  database 
and  updates  to  it  could  potentially  invalidate  existing  rules  or  introduce  new  rules.  The 
problem  of  updating  the  rules  can  be  reduced  to  finding  the  new  set  of  frequent  itemsets. 
Since  rule  generation  is  computationally  inexpensive,  it  is  not  critical  to  develop  incremental 
rule  generation  algorithms.  A  simple  solution  to  the  update  problem  is  to  re-compute  the 
frequent  itemsets  from  the  updated  database.  This  is  clearly  inefficient  because  all  the 
computations  done  initially  for  finding  the  old  frequent  itemsets  are  wasted.  An  algorithm, 
FUP  (Fast  Update),  for  updating  the  frequent  itemsets  has  been  developed  for  the  addition 
of  new  transactions  to  the  database  [36].   It  is  based  on  the  Apriori  algorithm  and  needs 
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0(n)  passes  over  the  database  where  n  is  the  size  of  the  maximal  frequent  itemset.  Another 
incremental  algorithm  is  presented  in  Feldman  et  al.  [47]. 

In  this  chapter,  we  present  an  algorithm  to  find  the  new  frequent  itemsets  with  min- 
imal re-computation  when  new  transactions  are  added  to  or  deleted  from  the  transaction 
database  [123].  Deletion  is  important  in  cases  where  we  want  to  analyze  the  data  in  a 
sliding  time  window.  The  important  characteristics  of  our  algorithm  are  the  following. 

•  Along  with  the  frequent  itemsets,  we  also  maintain  the  negative  border  [127]1  .  The 
algorithm  uses  negative  borders  to  decide  when  to  scan  the  whole  database  and  it 
can  be  used  in  conjunction  with  any  level-wise  algorithm  like  Apriori  [13]  or  Parti- 
tion [111]. 

•  We  first  compute  the  frequent  itemsets  of  the  increment  database.  The  algorithm 
requires  a  full  scan  of  the  whole  database  only  if  the  negative  border  of  the  frequent 
itemsets  expands,  that  is,  if  an  itemset  outside  the  negative  border  gets  added  to  the 
frequent  itemsets  or  its  negative  border.  Even  in  such  cases,  it  requires  only  one  I/O 
pass  over  the  whole  data  set. 

Constrained  association  mining  is  another  useful  technique  for  goal-oriented  mining. 
We  generalize  the  incremental  mining  algorithm  to  handle  cases  with  contstraints  on  the 
items  and  certain  kinds  of  constraint  relaxation. 

In  Section  7.1,  we  present  the  incremental  algorithm  to  update  the  frequent  itemsets. 
Experimental  results  quantifying  the  performance  advantages  of  the  incremental  algorithm 
are  presented  in  Section  7.2  and  in  Section  7.3,  we  compare  it  with  FUR  In  Section  7.4, 
we  explore  the  database  integration  alternatives  of  incremental  mining  and  in  Section  7.4.2 


Note  that  the  negative  border  can  be  maintained  while  computing  the  frequent  itemsets  without  any 
additional  computation  overhead. 
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we  present  the  performance  results  of  the  SQL-based  incremental  algorithm  [124].  In  Sec- 
tion 7.5,  we  introduce  constrained  associations  and  generalize  the  incremental  approach  to 
handle  constraints.  The  incremental  algorithm  is  applicable  to  a  larger  class  of  data  mining 
and  decision  support  problems.  Some  of  them  are  briefly  outlined  in  Section  7.6. 

7.1     Incremental  Updating  of  Frequent  Itemsets 

In  this  section,  we  develop  an  efficient  algorithm  for  updating  the  frequent  itemsets 
when  the  database  is  updated.  In  the  context  of  basket  data,  database  update  effectively 
means  addition  of  new  transactions  to  the  database  or  deletion  of  existing  transactions.  In 
Section  7.1.1,  we  describe  how  to  compute  the  negative  border  of  a  set  F  of  frequent  item- 
sets.  In  Section  7.1.2,  we  present  and  prove  some  lemmas  and  theorems  before  describing 
the  algorithm  to  handle  the  addition  of  transactions.  Handling  the  deletion  of  transactions 
is  outlined  in  Section  7.1.3. 

7.1.1     Computing  MBd(F)  from  F 

The  negative  border  (AfBd(F))  of  a  set  of  frequent  itemsets  F  can  be  computed  by 
repeating  the  join  and  prune  steps  of  the  apriori-gen  function  in  the  apriori  algorithm  [13]. 
This  computation  can  be  done  using  only  the  set  of  frequent  itemsets  F  and  the  database 
need  not  be  scanned. 

Definition  1  The  negative  border  AfBd(F),  of  a  collection  of  itemsets  F  is  defined  as  fol- 
lows: Given  a  collection  F  C  V{R)  of  sets,  closed  with  respect  to  the  set  inclusion  relation, 
the  negative  border  MBd(F)  of  F  consists  of  the  minimal  itemsets  X  C  R  not  in  F  [83]. 

The  apriori-gen  function  (described  in  Figure  7.1)  takes  as  argument  Fk-\,  the  set  of 
all  frequent  (k  -  l)-itemsets.  It  returns  the  set  of  A;-itemsets  that  are  potentially  frequent. 
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function  apriori-gen(F/t_i) 

for  each  p  and  q  £  Fk-\  do 

if  p. item i  =  q.itemi,. . .  ,p.itemk-2  =  q.item.k-2  and  p.item,k-\   <  q.itemk-i 
then  insert  p.item\ ,p.item,2,  ■  ■  ■  ,p.iterrik-i,q-itemk-i  into  Ck 

for  each  c  G  C* 

delete  c  from  Ck  if  some  (k  -  l)-subset  of  c  is  not  in  Fk-i 

Figure  7.1.  A  high-level  description  of  the  apriori-gen  function 

The  negative  border  consists  of  all  itemsets  that  were  candidates  which  did  not  have 
the  minimum  support  in  the  level-wise  method.  That  is,  NBd(Fk)  =  C*  —  i*1*  where  Cjt  is 
the  set  of  candidate  A;-itemsets,  Fk  is  the  set  of  frequent  A;-itemsets  and  MBd(Fk)  is  the  set 
of  A;-itemsets  in  MBd{F).  Therefore,  Fk\J  MBd{Fk)  =  Ck.  The  negativeborder-gen  function 
to  compute  MBd{F)  is  explained  in  Figure  7.2. 

function  negativeborder-gen(F) 

Split  F  into  F\ ,  F2, . . . ,  Fn  where  n  is  the  size  of  the  largest  itemset  in  F 
forall  k  =  1,2, ...,n  do 

compute  Ck+i  using  apriori-gen(Fk) 
FU  NBd(F)  =  \Ji=2      n+1  Ck  U/i  where  h  is  the  set  of  1-itemsets. 

Figure  7.2.  A  high-level  description  of  the  negativeborder-gen  function 
Lemma  1  All  1-itemsets  should  be  present  in  FL)  MBd(F). 
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Proof:    The  lemma  follows  directly  from  the  definition  of  negative  border  since  any  1- 
itemset  not  in  F  is  a  minimal  itemset  not  in  F.  □ 

7.1.2     Addition  of  New  Transactions 

When  new  transactions  are  added  to  the  database,  an  old  frequent  itemset  could 
potentially  become  infrequent  in  the  updated  database.  Similarly,  an  old  infrequent  itemset 
could  potentially  become  frequent  in  the  new  database. 

In  order  to  solve  the  update  problem  efficiently,  we  maintain  the  frequent  itemset 
and  the  negative  border  along  with  their  support  count  in  the  database.  That  is,  for 
every  s  <E  Fl)NBd(F),  we  maintain  s.count.  In  the  rest  of  this  section,  DB  denotes  the 
original  database,  db  denotes  the  transactions  that  are  newly  added  and  DB+  denotes  the 
updated  database.  Also  FDB,  Fdb  and  FDB+  denotes  the  frequent  itemset  and  NBd{FDB), 
AfBd(Fdb)  and  MBd{FDB+)  denotes  the  negative  border  of  the  original  database,  increment 
database  and  the  updated  database  respectively. 

Lemma  2  Let  s  be  any  itemset  such  that  s  0  FDB .   Then  s  G  FDB+  only  if  s  €  Fdb. 

Proof:  Assume  that  there  exists  an  itemset  s  such  that  s  G  FDB+,  s  0  FDB  and  s  g  Fdb. 
Let  tDB{s)  and  tdb{s)  be  the  number  of  transactions  in  DB  and  db  respectively  containing 
the  itemset  s.  Also  let  tDB  and  tdb  be  the  total  number  of  transactions  in  DB  and  db 
respectively.  Since  s  0  FDB  and  s  £  Fdb, 


tDB(s)  .    _  ,         ,  tdb(s) 

— <  mmbupport   and  — —    <  minSupport. 

tDB  tdb 


From  these  two  equations,  it  can  be  shown  that 
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tDB(s)  +  tdb{s)  . 

<   mmbupport 

tDB  +  tdb 


Therefore,  s  #  FDB+  which  is  a  contradiction.  D 

Lemma  3  Let  s  be  an  itemset  such  that  s  E  MBd(F).  Then  all  possible  subsets  of  s  must 
be  present  in  F. 

Proof:  For  a  contradiction,  let  t  be  an  itemset  such  that  t  C  s  and  t  0  F.  By  the  definition 
of  negative  border,  AfBd(F)  consists  of  the  minimal  itemsets  not  in  F.  Since  t  0  F,  s  is  not 
a  minimal  itemset  not  in  F.  Therefore  s  cannot  be  in  AfBd(F),  which  is  a  contradiction. 

□ 

Theorem  1  Let  s  be  an  itemset  such  that  s  &  FDB  UAfBd(FDB)  and  s  6  FDB+.  Then 
there  exists  an  itemset  t  such  that  t  C  s,  t  E  MBd{FDB)  and  t  E  FDB+.  That  is,  some 
subset  of  s  has  moved  from  MBd(FDB)  to  FDB+. 

Proof:  Since  s  e  FDB+,  all  possible  subsets  of  s  should  be  in  FDB+.  But  all  the  subsets 

of  s  cannot  be  in  FDB  because  if  that  was  the  case,  then  s  should  be  present  in  at  least 

MBd{FDB)  if  not  in  FDB  itself.  By  our  assumption,  s  £  FDB  U  MBd{FDB).  Therefore, 

there  exists  an  itemset  t  such  that  tCs  and  t  0  FDB.  Now  we  have  two  cases. 

Case  »  :  t  €  MBd{FDB). 

In  this  case,  t  E  FDB+  since  s  E  FDB+  and  t  C  s.  Therefore,  we  have  found  a  subset  of  s 

which  has  moved  from  MBd(FDB)  to  FDB+ . 

Case  ii  :  t$NBd{FDB). 

That  is,  t  £  FDBuMBd(FDB).  But,  we  know  that  t  E  FDB+  since  s  E  FDB+  and  t  C  a. 

Therefore,  t  &  FDB  U  tfBd(FDB)  and  t  E  FDB+  and  hence  we  can  apply  the  theorem 

recursively  on  t.  Note  that  the  size  of  t  is  less  than  the  size  of  s  since  t  C  s. 
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When  this  is  applied  recursively,  there  are  two  possibilities.  First  is,  for  some  subset  of 
t,  case  i  holds  true  in  which  case,  there  is  a  subset  of  t  which  has  moved  from  MBd{FDB)  to 
FDB+,  and  hence  the  theorem  is  proved.  Otherwise,  t  will  finally  become  a  1-itemset.  By 
Lemma  1,  we  know  that  all  1-itemsets  are  present  in  FDB  UAfBd(FDB).  Since  t  #  FDB, 
t  6  NBd{FDB)  which  contradicts  the  assumption  for  case  it.  That  is,  case  ii  is  not  possible 
if  t  is  a  1-itemset.  □ 

By  Theorem  1,  if  none  of  the  itemsets  move  from  the  negative  border  to  the  frequent 
itemset,  we  do  not  need  to  scan  the  whole  database.  Even  in  cases  where  some  itemsets 
move  from  the  negative  border  to  the  frequent  itemset,  a  complete  database  scan  is  required 
only  if  the  negative  border  expands  because,  for  all  the  itemsets  in  the  negative  border,  we 
can  derive  the  updated  support  count  without  a  database  scan. 

We  maintain  the  support  count  for  all  itemsets  in  the  frequent  itemset  and  the  negative 
border.  First,  we  compute  the  frequent  itemset  in  db  using  a  level-wise  algorithm  like  Apriori 
or  Partition.  Simultaneously  we  count  the  support  for  all  itemsets  in  FDB  UAfBd{FDB) 
in  db.  If  an  itemset  t  e  FDB  does  not  have  minimum  support  in  DB  U  db,  then  t  is 
removed  from  FDB .  This  can  be  easily  checked  since  we  know  the  support  count  for  t  in 
DB  and  db.  The  change  in  FDB  could  potentially  change  AfBd(FDB)  also.  Therefore,  we 
have  to  recompute  the  negative  border  using  the  negativeborder-gen  function  explained  in 
subsection  7.1.1. 

On  the  other  hand  there  could  be  some  new  itemsets  which  become  frequent  in  the 
updated  database.  Let  s  be  an  itemset  which  gets  added  to  the  frequent  itemset  of  the 
updated  database.  By  Lemma  2,  we  know  that  s  has  to  be  in  Fdb.  We  also  know  by 
Theorem  1  that  some  subset  of  s  must  move  from  MBd(FDB)  to  FDB+ .  For  each  itemset 
s  £  Fdb,  we  check  if  s  gets  the  minimum  support  to  move  from  MBd{FDB)  to  FDB+.  If 
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/imc£JonUpdate-Frequent-Itemset(FDB,  MBd(FDB),  db) 

//DB  and  db  denote  the  number  of  transactions  in  the  original  database  and  the 
increment  database  respectively. 

Compute  Fdb 

for  each  itemset  s  €  FDB  UAfBd(FDB)  do 

tdb(s)  =  number  of  transactions  in  db  containing  s 
FDB+  =  4> 
for  each  itemset  s  £  FDB  do 

if  {tDB(s)  +  tdb{s))  >  minsup  *  (DB  +  db)  then  FDB+  =  FDB+  U  s 

for  each  itemset  s  €  Fdb  do 

ifsg  FDB  and  s  e  MBd(FDB)  and  (tDB{s)  +  tdb(s))  >  minsup*  (DB  +  db) 
then  FDB+  =  FDB+  U  s 
ifFDB^FDB+  then 

AfBd(FDB+)  =  negativeborder-gen(Fr,B+) 
else  NBd(FDB+)  =  NBd(FDB) 
ifFDBuMBd(FDB)  ?  FDB+  \jtfBd(FDB+)  then 

S  =  FDB+ 

repeat 

compute  5  =  S  U  MBd(S) 

until  S  does  not  grow 

FDB+  =  {iE  S\support(x)  >  minsup} 

//support(x)  is  the  support  count  of  x  in  DB  U  db 

MBd(FDB+)  =  negativeborder-gen(FZ5B+) 


Figure  7.3.  A  high-level  description  of  the  Update-Frequent-Itemset  function 
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none  of  the  itemsets  in  MBd(FDB)  gets  the  minimum  support,  no  new  itemsets  will  be 
added  to  FDB+.  If  some  itemsets  in  MBd{FDB)  gets  the  minimum  support  move  them  to 
FDB+  and  recompute  the  negative  border.  If  FDB+ uMBd{FDB+)  ^  FDB  uXBd{FDB), 
we  have  to  find  the  negative  border  closure  of  FDB+  and  scan  the  entire  database(.DI?) 
once  to  find  the  updated  frequent  itemset  and  negative  border.  The  negative  border  closure 
of  F  is  found  by  repeatedly  finding  F  =  F  uNBd(F)  until  F  does  not  grow. 

During  the  database  scan,  all  the  itemsets  which  are  in  the  negative  border  closure 
that  were  not  originally  in  FL)AfBd(F)  are  used  as  the  candidate  itemsets  and  their  support 
count  is  computed.  The  candidate  set  can  further  be  pruned  by  applying  an  optimization 
while  finding  the  negative  border  closure.  It  can  be  observed  that  an  itemset  which  is  not 
frequent  in  the  increment  database  (db)  cannot  get  added  to  the  updated  set  of  frequent 
itemsets.  Therefore,  such  itemsets  can  be  pruned  at  each  step  of  the  negative  border  closure 
computation  to  get  the  pruned  negative  border  closure.  However,  the  support  count  of  these 
pruned  itemsets  should  also  be  found  since  they  may  potentially  be  in  the  updated  negative 
border. 

7.1.3     Deletion  of  Existing  Transactions 

Similar  to  the  case  where  new  transactions  are  added  to  the  database,  the  frequent 
itemset  and  its  negative  border  could  potentially  change  when  some  existing  transactions 
are  deleted  from  the  database.  As  in  the  former  case,  we  maintain  the  frequent  itemset  and 
the  negative  border  along  with  their  support  count  in  the  database.  Let  DB—  denote  the 
updated  database  and  FDB~  and  AfBd(FDB~)  denote  its  frequent  itemset  and  negative 
border  respectively. 
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Lemma  4  Let  s  be  an  itemset  such  that  s  G  FDB .    Then  s  &  FDB~  only  if  s  e  Fdb.    That 
is  a  frequent  itemset  s  will  become  infrequent  only  if  s  6  Fdb . 

This  lemma  can  be  proved  in  the  same  way  as  lemma  2. 

The  algorithm  to  compute  the  frequent  itemset  and  the  negative  border  of  DB—  is 
similar  to  the  one  in  the  case  where  new  transactions  are  added  to  the  database  and  can 
be  derived  easily. 

7.2     Experimental  Results 

We  conducted  a  set  of  experiments  to  compare  the  performance  of  our  incremental 
algorithm.  These  experiments  were  performed  on  a  Sun  SPARCstation  4  running  SunOS 
5.5.  In  this  section,  we  report  on  the  results  of  some  of  those  experiments. 

The  experiments  were  performed  on  synthetic  data  generated  using  the  same  tech- 
nique as  in  Agrawal  and  Srikant  [13].  The  dataset  used  for  the  baseline  experiment  was 
T10.I4.D100K  (Mean  size  of  a  transaction  =  10,  Mean  size  of  maximal  potentially  large 
itemsets  =  4,  Number  of  transactions  =  100  thousand).  The  increment  database  is  created 
as  follows:  We  generate  100  thousand  transactions,  of  which  (100  —  d)  thousand  is  used  for 
the  initial  computation  and  d  thousand  is  used  as  the  increment,  where  d  is  the  fractional 
size  (in  percentage)  of  the  increment. 

We  compare  the  execution  time  of  the  incremental  algorithm  with  respect  to  running 
Apriori  on  the  whole  data  set.  Figure  7.4  shows  the  speed  up  of  the  incremental  algorithm 
over  Apriori  for  different  minimum  support  thresholds.  We  report  the  results  for  increment 
sizes  of  1%,2%,5%  and  10%  (shown  in  the  legend).  From  the  graph,  it  can  be  seen  that 
the  incremental  algorithm  achieves  speed  up  of  about  3  to  20.  The  algorithm  shows  better 
speed  up  for  medium  support  threshold  than  low  and  high  support  thresholds.    At  high 
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Figure  7.4.  Speed-up  of  the  incremental  algorithm 

support  thresholds,  the  number  of  frequent  itemsets  and  the  number  of  passes  in  the  level- 
wise  algorithm  are  less  and  hence  it  is  less  costly  to  run  Apriori  on  the  whole  database.  At 
low  support  thresholds,  the  probability  of  the  negative  border  expanding  is  higher  and  as  a 
result  the  incremental  algorithm  may  have  to  scan  the  whole  database.  Also,  the  speed  up 
is  higher  for  smaller  increment  sizes  since  the  incremental  algorithm  needs  to  process  less 
data. 

7.3     Comparison  with  FUP 

The  framework  of  FUP  [36]  is  similar  to  that  of  Apriori  and  contains  a  number  of 
iterations.  Each  iteration  is  associated  with  a  complete  scan  of  the  whole  database  and 
in  iteration  k  all  the  frequent  fc-itemsets  are  found.  The  candidate  sets  for  iteration  k  +  l 
are  generated  based  on  the  frequent  itemsets  found  in  iteration  k.  FUP  uses  the  following 
steps  to  compute  the  frequent  itemsets. 

1.  At  each  iteration  k,  the  support  of  the  frequent  A;-itemsets  (Lk)  is  updated  against 
the  increment  database  db  to  filter  out  the  itemsets  that  are  no  longer  frequent  in  the 
updated  database. 
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2.  The  set  of  candidate  sets  C/t  is  generated  by  applying  the  apriori-gen  function  on 
Lk_^~  (frequent  (k-l)-itemsets  in  the  updated  database).  The  itemsets  in  Lk  are 
pruned  from  Cfc  because  they  have  already  been  handled  in  the  above  step.  C*  is 
further  pruned  by  removing  the  itemsets  that  do  not  have  minimum  support  in  db 
since  they  cannot  be  frequent  in  the  updated  database. 

3.  Scan  the  whole  database  to  count  the  support  of  all  the  itemsets  in  Ck  and  the  ones 
which  have  the  minimum  support  are  identified  as  the  new  frequent  A;-itemsets. 

The  speed  up  of  FUP  over  Apriori  can  be  mainly  attributed  to  the  reduction  in  the 
number  of  candidate  itemsets  thereby  reducing  the  computation.  However  there  is  no 
reduction  in  the  I/O  requirements  because,  like  apriori  FUP  may  also  require  0{n)  passes 
over  the  database  where  n  is  the  size  of  the  maximal  frequent  itemset. 

The  steps  involved  in  our  incremental  algorithm  can  be  summarized  as  follows. 

1.  Compute  the  frequent  itemsets  of  the  increment  database  (Ldb)  using  any  level- wise 
algorithm  like  Apriori  or  Partition.  Simultaneously,  count  the  support  in  db  of  all 
the  frequent  itemsets  and  the  negative  border. 

2.  Using  the  above  support  counts  and  the  negativeborder-gen  function,  determine  if 
the  negative  border  has  expanded. 

3.  In  case  of  a  negative  border  expansion,  find  the  negative  border  closure  and  scan  the 
whole  database  once  to  count  the  support  of  the  newly  added  itemsets.  The  ones 
with  minimum  support  are  added  to  the  set  of  frequent  itemsets. 

The  notable  feature  here  is  that  the  whole  database  is  scanned  only  if  required  (and 
that  too  only  once),  thereby  reducing  the  I/O  requirements  drastically.    Computing  the 
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negative  border  closure  may  increase  the  size  of  the  candidate  set.  However,  a  majority  of 
those  itemsets  would  have  been  present  in  the  original  negative  border  or  frequent  itemset. 
Only  those  itemsets  which  were  not  covered  by  the  negative  border  need  to  be  checked 
against  the  whole  database.  As  a  result,  the  size  of  the  candidate  set  in  the  final  scan  could 
potentially  be  much  smaller  as  compared  to  FUR 

7.4     Database  Integration  of  Incremental  Mining 

In  this  section,  we  discuss  the  various  database  integration  architectures  for  incremen- 
tal mining  based  on  the  alternatives  presented  in  Chapter  2.  In  Section  7.4.1,  we  present 
two  SQL-based  approaches  for  incremental  frequent  itemset  computation.  The  applicability 
of  the  other  architectural  alternatives  are  discussed  in  Section  7.4.4. 

7.4.1     SQL  Formulations  of  Incremental  Mining 

Two  categories  of  SQL  formulations  for  frequent  itemset  mining  based  on  SQL-92  and 
SQL-OR  are  presented  in  Chapters  3  and  4  respectively.  A  cost-based  analysis  of  the  SQL- 
92  approaches  and  the  related  performance  optimizations  are  discussed  in  Thomas  and 
Chakravarthy  [125].  We  discuss  here  how  these  techniques  can  be  adapted  for  incremental 
mining.  Efficient  SQL  formulation  of  the  incremental  algorithm  entails  counting  support 
of  multiple-sized  candidates  together  in  the  same  pass. 

The  input  transaction  data  is  stored  in  a  relational  table  T  with  the  schema  (tid,  item). 
The  increment  transaction  table  ST  also  has  the  same  schema.  The  frequent  itemsets  and 
negative  border  of  size  k  are  stored  in  tables  with  the  schema  (itemi, ...  ,itemk,  count). 
We  discuss  the  extensions  to  Subquery  and  Vertical;  two  representative  approaches  from 
the  SQL-92  and  SQL-OR  categories.  We  also  outline  the  SQL-based  candidate  closure 
computation  to  compute  negative  border  closure. 
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Subquery  Approach 

In  this  approach,  support  counting  is  done  by  a  set  of  k  nested  subqueries  where  k  is 
the  size  of  the  largest  itemset.  We  present  here  the  extensions  to  the  subquery  approach  in 
Section  3.5.4  to  count  candidates  of  different  sizes.  Subquery  Qi  finds  all  tids  that  support 
the  distinct  candidate  /-itemsets  (di).  d/  comprises  of  the  distinct  /-item  prefixes  of  all 
candidate  itemsets.  However,  it  is  sufficient  to  use  C/,  the  candidate  /-itemsets  instead  of 
di  since  all  /-item  prefixes  of  candidate  itemsets  with  more  than  /  items  will  be  present 
in  C/.  The  output  of  Qi  is  grouped  on  the  /  items  to  find  the  support  of  the  candidate 
/-itemsets.  Q\  is  also  joined  with  6T  (T  while  counting  the  support  in  the  whole  database) 
and  Ci+\  to  get  Qi+\.  The  SQL  queries  and  the  corresponding  tree  diagrams  for  the  above 
computations  are  given  in  Figure  7.5.  SB[  stores  the  support  counts  of  all  frequent  and 
negative  border  itemsets  in  5T. 

The  output  of  subquery  Qi  needs  to  be  materialized  since  it  is  used  for  counting  the 
support  of  /-itemsets  and  for  generating  Qi+\.  If  the  query  processor  is  augmented  to 
support  multiple  streams  where  the  output  of  an  operator  can  be  piped  into  more  than  one 
subsequent  operators,  the  materialization  of  Qj's  can  be  avoided.  In  the  basic  association 
rule  mining,  we  do  not  have  to  count  itemsets  of  different  sizes  in  the  same  pass  since  Cj+i 
becomes  available  only  after  the  frequent  /-itemsets  are  computed. 

Tables  Bi  and  SB[  store  the  frequent  and  negative  border  /-itemsets  and  their  support 
count  in  T  and  ST  respectively.  Support  counts  of  itemsets  in  the  whole  database  can  be 
computed  by  joining  Bt  and  SBi  and  adding  the  corresponding  support  counts.  We  add 
another  attribute  to  Bt  to  keep  track  of  promoted  borders  (itemsets  that  moved  from  the 
negative  border  to  the  frequent  set). 
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insert  into  5Bt  select  itemi, . .  .,iterri[,  count(*) 
from  (Subquery  Qt)  t 

group  by    itemi,  ...,  itemi 


Subquery  Qt  (for  any  I  between  1  and  k): 
select  itemi,  ■  ■  ■  itemi,  tid 
from  ST  ti,  (Subquery  Qi-\)  as  rj_i,  Q 
where  n_i.iiemi  =  Q-itemi  and  . . .  and 

r/_i.iiem;_i  =  C;.iiem;_iand 

ri-i.tid  =  ti.tid  and 

ti.item  —  Ci.itemi 


Subquery  Q0  :  No  subquery  Q0. 


Group  by  for  5  B_l  Subquery  QJ+1 


Subquery  Ql 
itemi,. ...iteml,  tid  A 


r_l-l.tid  =  tl.tid 
ti.item    =  Cl.iteml 

r_l-l.  itemi      =  Cl.iteml 
rJ-l.item_l-l=Cl.item_l-l    |^><1 
U-l 
Subquery  QJ-1 


5T  tl 


C  I 


Figure  7.5.  Support  counting  using  subqueries 
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Computing  Candidate  Closure 

In  the  apriori  algorithm  candidate  itemsets  are  generated  in  two  steps:  the  join  step 
and  the  prune  step.  In  the  join  step,  two  sets  of  (k  -  l)-itemsets  called  generators  and 
extenders  are  joined  to  get  A;-itemsets.  An  itemset  si  in  generators  joins  with  s2  in  extenders 
if  the  first  {k  —  2)  items  of  si  and  52  are  the  same  and  the  last  item  of  si  is  lexicographically 
smaller  than  the  last  item  of  s2.  The  join  results  in  an  itemset  that  is  s1  extended  with  the 
last  item  of  s2-  The  result  of  the  join  step  is  subjected  to  subset  pruning  which  filters  out 
all  itemsets  with  a  non  frequent  (k  -  l)-subset.  This  can  be  accomplished  by  subsequent 
joins  with  (A;  -  2)  copies  of  a  set  of  itemsets  termed  filters.  While  generating  Ck  from  Ffc_: 
in  the  basic  apriori  algorithm,  Ffc_!  acts  as  generators,  extenders  and  filters. 

In  the  incremental  algorithm,  we  compute  the  candidate  closure  to  avoid  multiple 
passes  while  counting  the  support  of  the  new  candidates.  It  can  be  seen  that  all  the 
new  candidates  will  be  supersets  of  promoted  borders.  Therefore,  it  is  sufficient  to  use  the 
itemsets  that  are  promoted  borders  as  generators.  In  order  to  generate  Ck,  the  candidates  of 
size  k,  we  use  PBk_x  UCk-i  as  generators  and  PBk_x  \JCk-\  UFk_x  as  extenders  and  filters. 
PBk-i  and  Fk-\  denote  promoted  borders  and  frequent  itemsets  of  size  (&  — 1)  respectively. 
The  candidate  generation  process  starts  with  C0  as  the  empty  set  and  terminates  when  Ck 
becomes  empty.  It  is  straight-forward  to  derive  SQL  queries  for  this  process  and  we  do  not 
present  them  here  (refer  Section  3.4.1). 

Vertical 

In  the  Subquery  approach,  for  every  transaction  that  supports  an  itemset  we  generate 
(itemset,  tid)  tuples  resulting  in  large  intermediate  tables.  The  Vertical  approach  avoids 
this  by  collecting  all  tids  that  support  an  itemset  into  a  BLOB  (binary  large  object)  and 
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generates  (itemset,  tid-list)  tuples.  Initially,  tid-lists  for  individual  items  are  created  using  a 
table  function.  The  tid-list  for  an  itemset  is  obtained  by  intersecting  the  tid-lists  of  its  items 
using  a  user-defined  function  (UDF).  The  SQL  queries  for  support  counting  are  similar  in 
structure  to  that  of  the  Subquery  approach  except  for  the  use  of  UDFs  to  intersect  the 
tid-lists.  We  refer  the  reader  to  Section  4.2  for  the  details. 

The  increment  transaction  table  ST  is  transformed  into  the  vertical  format  by  creating 
the  delta  tid-lists  of  the  items.  The  delta  tid-lists  are  used  to  count  the  support  of  the 
candidate  itemsets  in  ST  which  are  later  merged  with  the  original  tid-lists.  This  can  be 
accomplished  by  joining  the  original  tid-list  table  with  the  delta  tid-list  table  and  merging 
the  tid-lists  with  a  UDF.  If  the  incremental  algorithm  requires  a  pass  over  the  complete 
data,  the  merged  tid-lists  are  used  for  support  counting. 

7.4.2     Performance  Results 

In  this  section,  we  report  the  results  of  some  of  our  performance  experiments  to  quan- 
tify the  advantages  of  the  incremental  mining  algorithm.  Note  that  incremental  mining  can 
be  considered  as  a  special  case  of  relaxing  the  frequency  constraint.  These  experiments  were 
performed  on  Version  5  of  IBM  DB2  Universal  Server  installed  on  a  Sun  Ultra  5  Model  270 
with  a  269  MHz  UltraSPARC-Hi  CPU,  128  MB  main  memory  and  a  4  GB  disk.  We  report 
the  results  of  the  SQL  formulations  of  the  incremental  algorithm  based  on  the  Subquery 
and  Vertical  approaches. 

The  experiments  were  performed  on  synthetic  data  generated  using  the  technique  in 
Agrawal  and  Srikant  [13].  The  dataset  used  for  the  experiment  was  T10.I4.D100K  (Mean 
size  of  a  transaction  =  10,  Mean  size  of  maximal  potentially  large  itemsets  =  4,  Number  of 
transactions  =  100  thousand).  The  increment  database  is  created  as  follows:  We  generate 
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100  thousand  transactions,  of  which  (100  —  d)  thousand  is  used  for  the  initial  computation 

and  d  thousand  is  used  as  the  increment,  where  d  is  the  fractional  size  (in  percentage)  of 

the  increment. 

|«1%  ■  5%  "oT6%  I 


Figure  7.6.  Speed  up  of  the  incremental  algorithm  based  on  the  Subquery  approach 
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Figure  7.7.  Speed  up  of  the  incremental  algorithm  based  on  the  Vertical  approach 


We  compare  the  execution  time  of  the  incremental  algorithm  with  respect  to  mining 
the  whole  dataset.  Figures  7.6  and  7.7  shows  the  corresponding  speed-ups  of  the  incremen- 
tal algorithm  based  on  the  Subquery  and  the  Vertical  approaches  for  different  minimum 
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support  thresholds.   We  report  the  results  for  increment  sizes  of  1%,5%  and  10%  (shown 
in  the  legend).  We  can  make  the  following  observations  from  the  graphs. 

•  The  incremental  algorithm  based  on  the  Subquery  approach  achieves  a  speed-up  of 
about  3  to  20  as  compared  to  mining  the  whole  dataset.  However,  the  maximum 
speed-up  of  the  Vertical  approach  is  only  about  4.  For  support  counting  the  Vertical 
approach  uses  a  user-defined  function  (UDF)  to  intersect  the  tid-lists.  The  incremen- 
tal algorithm  should  also  invoke  the  UDF  at  least  the  same  number  of  times  since  the 
support  of  all  the  itemsets  in  the  frequent  set  and  the  negative  border  needs  to  be 
found  in  the  increment  database.  In  cases  where  the  support  of  new  candidates  needs 
to  be  counted  the  number  of  invocations  will  be  even  more  as  compared  to  mining 
the  whole  dataset.  The  time  taken  by  the  Vertical  approach  in  the  support  counting 
phase  is  directly  proportional  to  the  number  of  times  the  UDF  is  called.  However,  the 
incremental  algorithm  saves  in  the  tid-list  creation  phase  since  the  size  of  the  incre- 
ment dataset  is  only  a  fraction  of  the  whole  dataset.  This  explains  why  the  speed-up 
of  the  Vertical  approach  is  low.  In  contrast  the  Subquery  approach  achieves  higher 
speed-up  since  the  time  taken  is  proportional  to  the  size  of  the  dataset. 

It  is  possible  to  achieve  better  speed-up  for  the  Vertical  approach  by  allocating  a 
smaller  BLOB  (binary  large  object)  for  computations  involving  the  increment  dataset. 
Note  that  the  tid-lists  for  the  items  are  stored  as  BLOBs.  In  our  experiments,  we  used 
the  same  BLOB  size  for  the  increment  dataset  and  the  initial  dataset  in  order  to  use 
the  same  user-defined  function  for  support  counting  and  the  same  table  function  for 
tid-list  creation  (refer  Section  4.2  for  a  detailed  description  of  the  Vertical  approach). 
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•  The  speed-up  reduces  as  the  minimum  support  threshold  is  lowered.  At  lower  support 
values  the  chances  of  the  negative  border  expanding  is  higher  and  as  a  result  the 
incremental  algorithm  may  have  to  compute  the  candidate  closure  and  count  the 
support  of  the  new  candidates  in  the  whole  dataset. 

•  The  speed-up  is  higher  for  smaller  increment  sizes  since  the  incremental  algorithm 
needs  to  process  less  data. 

•  With  respect  to  the  absolute  execution  time,  the  Subquery  and  the  Vertical  ap- 
proaches followed  the  same  trend  as  explained  in  Chapter  4.  The  Vertical  approach 
was  about  3  to  6  times  faster  than  the  Subquery  approach. 

7.4.3     New-Candidate  Optimization 

In  the  basic  incremental  algorithm,  we  find  the  frequent  itemsets  in  the  increment 
database  db  along  with  counting  the  support  of  all  the  itemsets  in  the  frequent  set  and 
the  negative  border.  However,  the  frequent  itemsets  in  db  is  used  only  to  prune  the  non 
frequent  itemsets  in  db  while  computing  the  candidate  closure.  In  the  candidate  closure 
computation  we  assume  that  the  new  candidate  A;-itemsets  are  frequent  while  generating 
the  (k  +  l)-itemsets.  At  this  step  the  new  candidate  A;-itemsets  that  are  infrequent  in  db 
are  known  to  be  infrequent  in  the  whole  dataset  as  well  and  can  be  pruned.  This  is  because 
the  new  candidate  A;-itemsets  were  infrequent  in  the  old  dataset  (they  were  not  even  in  the 
negative  border).  Therefore,  they  need  to  be  frequent  at  least  in  db  to  have  a  chance  of 
being  frequent  in  the  whole  dataset. 

With  the  new-candidate  optimization,  we  count  the  support  of  an  itemset  in  db  only 
if  it  is  required.  In  the  first  phase,  while  counting  the  support  in  db  of  the  itemsets  in  the 
frequent  set  and  the  negative  border,  we  do  not  find  all  the  frequent  itemsets  in  db.  During 
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the  candidate  closure  computation,  we  count  the  support  in  db  of  just  the  new-candidates 
for  the  pruning  explained  above.  This  results  in  better  speed-up  as  compared  to  the  basic 
incremental  algorithm. 


Minimum   Support   (in   *?*») 


Figure  7.8.   Speed  up  of  the  incremental  algorithm  based  on  the  Subquery  approach  with 
the  new-candidate  optimization 
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Figure  7.9.  Speed  up  of  the  incremental  algorithm  based  on  the  Vertical  approach  with  the 
new-candidate  optimization 


Figures  7.8  and  7.9  show  the  speed-up  of  the  Subquery  and  Vertical  approaches  with 
the  new-candidate  optimization.  We  can  observe  that  this  optimization  achieves  speed- 
ups  that  are  up  to  25%  better.     The  improvement  is  more  at  smaller  increment  sizes. 
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The  reason  is  that,  when  the  increment  is  smaller  we  have  to  use  proportionately  smaller 
minimum  support  values  while  finding  the  frequent  itemsets  in  db.  This  could  result  in 
counting  too  many  spurious  candidates. 

7.4.4     Other  Approaches 

In  the  Loose-coupling  approach,  the  transaction  data  is  read  tuple  by  tuple  from  the 
DBMS  to  the  mining  kernel  using  a  cursor  interface.  This  architecture  can  be  extended 
to  handle  incremental  mining  just  by  implementing  the  incremental  algorithm  outlined  in 
Section  7.1  in  the  mining  kernel.  The  DBMS  interface  does  not  require  any  change.  In 
cases  where  the  support  of  new  itemsets  need  to  be  counted,  limiting  the  data  access  to 
just  one  scan  of  the  whole  database  entails  counting  candidate  itemsets  of  multiple  sizes 
in  the  same  pass.  This  can  be  accomplished  by  passing  the  transactions  through  all  the 
candidates  of  different  sizes   and  updating  their  support  counts. 

The  Stored-procedure  approach  where  the  mining  algorithm  is  encapsulated  as  a 
stored  procedure  that  runs  in  the  same  address  space  as  the  DBMS  and  the  Cache-Mine 
approach  where  the  data  is  cached  outside  the  DBMS,  can  also  be  extended  for  incremen- 
tal mining.  However,  the  Cache-Mine  approach  might  not  give  better  performance  than 
the  others  since  the  incremental  algorithm  requires  at  most  one  scan  of  the  entire  data. 
Extending  the  UDF  approach  for  incremental  mining  is  straight-forward  and  will  involve 
writing  UDFs  for  the  different  steps  of  the  incremental  algorithm. 

7.5     Constrained  Associations 

In  this  section,  we  introduce  associations  with  different  kinds  of  constraints  on  the 
itemsets  or  constraints  that  characterize  the  dataset  from  which  the  associations  are  de- 
rived.   Let  1  =  {ii,«2i im}  be  a  set  of  literals,  called  items  that  are  attribute  values 
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of  a  set  of  relational  tables.  A  constrained  association  is  defined  as  a  set  of  itemsets 
{X|X  C  ISzC(X)},  where  C  denotes  one  or  more  boolean  constraints.  Note  that  we  do 
not  concentrate  on  generating  the  association  rules  in  the  traditional  sense  [7].  However, 
the  associations  between  attribute  values  that  we  generate  can  be  used  for  rule  generation. 

7.5.1     Categories  of  Constraints 

We  divide  the  constraints  into  four  different  categories  that  are  outlined  below2  We 
illustrate  each  of  them  with  sample  mining  computations.  The  data  model  used  in  our 
examples  is  that  of  a  point-of-sale  (POS)  model  for  a  retail  chain.  When  a  customer  buys 
a  product  or  series  of  products  at  a  register,  that  information  is  stored  in  a  transactional 
system,  which  is  likely  to  hold  other  information  such  as  who  made  the  purchase  and  what 
types  of  promotions  were  involved.  The  data  is  stored  in  three  relational  tables  sales, 
productssold  and  product  with  respective  schemas  as  shown  in  Figure  7.10. 


SALES 


PRODUCTS  SOLD 


PRODUCT 


Transaction  id 


Customer_id 
Total  price 
No.  of  products 


Transaction_id 
Product_id 


Price 
Promotion  id 


Product  id 


Type 

Name 

Description 


Figure  7.10.  Point  of  sales  data  model 


Frequency  Constraint 

This  is  the  same  as  the  minimum  support  threshold  in  the  "support-confidence"  frame- 
work for  association  rule  mining  [13].  An  itemset  X  is  said  to  be  frequent  if  it  appears  in 
at  least  s  transactions,  where  s  is  the  minimum  support.  In  the  point-of-sales  data  model  a 


2We  refer  the  reader  to  Srikant  et  al.  [121]  and  Ng  et  al.  [97]  for  nice  discussions  of  various  kinds  of 
constraints.  Here  we  categorize  them  based  on  their  usage  in  the  mining  process  which  is  explained  in 
Section  7.5.2  with  an  example 
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transaction  correspond  to  a  customer  transaction.  Note  that  in  other  domains,  the  notion 
of  a  transaction  and  an  itemset  appearing  in  a  transaction  could  be  different.  Frequent 
itemsets,  the  ones  which  satisfy  the  frequency  constraint  are  defined  as  {X\f(X)  >  s} 
where  f(X)  is  the  frequency  of  X. 

Most  of  the  algorithms  for  frequent  itemset  discovery  utilizes  the  downward  closure 
property  of  itemsets  with  respect  to  the  frequency  constraint;  that  is,  if  an  itemset  is 
frequent,  then  so  are  all  its  subsets.  Downward  closure  is  a  pruning  property.  Level-wise 
algorithms  [7]  find  all  itemsets  with  a  given  property  among  itemsets  of  size  k  (fc-itemsets) 
and  use  this  knowledge  to  explore  (k  +  l)-itemsets.  They  start  with  the  assumption  that 
all  (k  +  l)-itemsets  are  potentially  frequent  (frequency  is  just  an  example  for  a  downward 
closed  property).  As  the  A;-itemsets  are  examined,  they  prune  out  some  (k  +  l)-itemsets 
that  cannot  be  frequent.  In  effect,  for  pruning,  they  use  the  contrapositive  of  the  frequent 
itemset  definition  "if  any  subset  of  a  (k  +  l)-itemset  is  not  frequent,  then  neither  can  the 
(k  +  l)-itemset".  After  the  pruning,  they  go  through  the  remaining  list,  checking  each 
(k  +  l)-itemset  for  its  frequency.  The  downward  closure  property  is  similar  to  the  anti- 
monotonicity  property  defined  in  Ng  et  al.  [97]. 

In  the  context  of  the  point-of-sale  data  model,  the  frequency  constraint  can  be  used 
to  discover  products  bought  together  frequently. 

Item  Constraint 

In  order  to  discover  goal-oriented  associations,  it  is  often  required  to  impose  constraints 
on  the  itemsets.  For  instance,  find  only  the  itemsets  containing  at  least  one  item  from  a 
user-defined  subset  of  items.  Let  B  be  a  boolean  expression  over  the  set  of  items  1.  The 
problem  is  to  find  itemsets  that  satisfy  the  constraint  B.  Typically  the  item  constraint  will 
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be  associated  with  a  frequency  constraint  also,  where  we  want  to  find  frequent  itemsets 
that  satisfy  B.  Three  different  algorithms  for  mining  associations  with  item  constraints  are 
presented  in  Srikant  et  al.  [121]. 

The  item  constraints  enables  us  to  pose  mining  queries  such  as  "What  are  the  products 
whose  sales  are  affected  by  the  sale  of,  say,  barbecue  sauce?"  and  "What  products  are 
bought  together  with  sodas  and  snacks?" .  The  1-variable  constraints  in  Ng  et  al.  [97]  are 
a  special  case  of  item  constraints. 

Aggregation  Constraint 

These  are  constraints  involving  aggregate  functions  on  the  items  that  form  the  item- 
set.  For  instance,  in  the  POS  example  an  aggregation  constraint  could  be  of  the  form 
min(productssold.price)  >  p.  Here  we  consider  a  product  as  an  item.  The  aggregate  func- 
tion could  be  min,  max,  sum,  count,  avg  or  any  other  user-defined  aggregate  function.  An 
aggregation  constraint  of  the  form  min(productssold.price)  >  p  can  be  used  to  find  "ex- 
pensive products  that  are  bought  together".  Similarly  max(productssold.price)  <  q  can 
be  used  to  find  "inexpensive  products  that  are  bought  together" .  These  aggregate  functions 
can  be  combined  in  various  ways  to  express  a  whole  range  of  useful  mining  computations. 
For  example,  the  constraint  (min(productssold.price)  <  p)  &:  (avg(productssold.price)  > 
q)  targets  the  mining  process  to  inexpensive  products  that  are  bought  together  with  the 
expensive  ones. 

External  Constraint 

External  constraints  filter  the  data  used  in  the  mining  process.  These  are  constraints 
on  attributes  which  do  not  appear  in  the  final  result  (which  we  call  external  attributes). 
For  example,  if  we  want  to  find  "products  bought  during  big  purchases  where  the  total 
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sale  price  of  the  transaction  is  larger  than  some  amount"  we  can  impose  a  constraint  of  the 
form  sales.totalpri.ee  >  P.  These  constraints  are  useful  to  target  the  mining  process  to 
just  the  relevant  data  there  by  speeding  up  the  process. 

7.5.2     Constrained  Association  Mining 
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Figure  7.11.  Framework  for  constrained  association  mining 

Figure  7.11  shows  the  general  framework  for  the  kth  level  in  a  level- wise  approach  for 
mining  associations  with  constraints.  Note  that  we  could  also  use  SQL-based  approaches  for 
implementing  the  various  operations  in  the  different  phases.  The  different  constraints  are 
applied  at  different  steps  in  the  mining  process.  For  example,  the  item  constraints  and  the 
aggregation  constraints  that  satisfy  the  closure  property  can  be  used  in  the  candidate  gen- 
eration phase  for  pruning  unwanted  candidates.  Three  different  algorithms  for  generating 
candidate  itemsets  in  the  presence  of  item  constraints  are  presented  in  Srikant  et  al.  [121] 
and  we  could  use  them  here.  The  aggregation  constraints  can  be  applied  on  the  result  of 
the  candidate  generation  process  with  item  constraints.  The  external  constraints  and  some 
of  the  aggregation  and  item  constraints  can  be  applied  at  the  data  filtering  stage  to  reduce 
the  size  of  the  input  data  to  the  support  counting  phase  (see  Figure  7.12  for  a  specific 
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example).  The  frequency  constraint  is  applied  during  support  counting  to  filter  out  the 
non-frequent  itemsets.  Finally,  any  unprocessed  constraints  are  checked  as  a  post-counting 
operation.  The  frequent  itemsets  before  the  post-counting  constraint  check  are  used  for  the 
next  level  candidate  generation  since  these  constraints  do  not  possess  the  closure  property. 
In  the  basic  frequent  itemset  mining,  only  the  frequency  constraint  is  present  and  the 
candidate  generation  step  uses  only  the  subset  pruning  strategy  [13]. 

Example:  We  illustrate  the  application  of  the  various  constraints  here  with  a  spe- 
cific example  using  the  point-of-sales  data  model  in  Section  7.5.  The  example  shown  in 
Figure  7.12  finds  product  combinations  containing  "barbecue  sauce"  where  all  the  prod- 
ucts cost  less  than  $50  and  the  average  price  is  more  than  $25  (some  notion  of  similar  priced 
products).  The  combinations  should  appear  in  at  least  100  sales  transactions  with  the  total 
price  of  the  transaction  greater  than  $500.  This  gives  an  idea  of  what  other  moderately 
priced  products  people  buy  with  barbecue  sauce  in  big  purchases  (perhaps  for  parties).  It 
could  help  the  shop  owner  to  decide  on  various  promotions. 
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Figure  7.12.  Point-of-sales  example  for  constrained  association  mining 
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In  the  above  example,  the  constraint  on  the  total  price  of  the  transaction  is  an  external 
constraint  and  is  applied  in  the  data  filtering  stage.  Since  the  maximum  price  of  a  product 
in  the  desired  combination  is  $50,  we  can  also  filter  out  records  that  does  not  satisfy  this 
condition  in  the  data  filtering  stage.  max(Price)  <  50  is  an  aggregation  constraint  which 
satisfies  the  closure  property  and  can  be  applied  in  the  candidate  generation  phase.  The 
constraint  to  include  barbecue  sauce  in  the  combination  and  the  constraint  on  the  average 
price  are  checked  in  the  post-counting  phase  since  they  do  not  satisfy  the  closure  property. 

7.5.3     Incremental  Constrained  Association  Mining 

The  negative  border  based  incremental  mining  algorithm  is  applicable  for  mining  as- 
sociations with  constraints  that  are  closed  with  respect  to  the  set  inclusion  property,  that 
is,  if  an  itemset  satisfies  the  constraint  then  so  do  all  its  subsets.  For  incremental  mining 
of  constrained  associations  also  we  need  to  materialize  and  store  all  the  itemsets  in  the 
negative  border  and  their  support  counts.  The  reason  why  this  is  enough  is  that  when  new 
transaction  data  is  added,  only  the  support  count  of  the  itemsets  could  change.  We  assume 
that  there  is  a  frequency  constraint,  which  is  typically  the  case  in  association  mining.  The 
definition  of  the  negative  border  also  remains  the  same,  which  is  the  set  of  minimal  itemsets 
that  did  not  satisfy  the  frequency  constraint.  We  list  the  various  steps  of  the  incremental 
algorithm  which  is  very  similar  to  the  case  with  just  the  frequency  constraint  except  for  a 
few  differences. 

1.  Find  the  frequent  itemsets  in  the  increment  database  db.  Simultaneously  count  the 
support  of  all  itemsets  X  £  FrequentSets  U  NegativeBorder  in  db.  For  this  we  use 
the  framework  outlined  in  Section  7.5.2.  The  different  constraints  that  are  present 
can  be  handled  at  the  various  steps  in  the  mining  process  as  shown  in  Figure  7.11. 
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2.  Update  the  support  count  of  all  itemsets  in  FrequentSets  U  NegativeBorder  to  in- 
clude their  support  in  db. 

3.  Find  the  itemsets  in  NegativeBorder  that  became  frequent  by  the  addition  of  db  (call 
them  Promoted-NegativeBorder,  PNb). 

4.  Find  the  candidate  closure  with  PNb  as  the  seed.  During  this  computation,  we  use 
the  candidate  generation  procedure  with  constraints  as  described  in  Section  7.5.2. 

5.  If  there  are  no  new  itemsets  in  the  candidate  closure,  skip  this  step.  Otherwise,  count 
the  support  of  all  new  itemsets  in  the  candidate  closure  against  the  whole  dataset. 
The  dataset  is  subjected  to  the  data  filtering  step  before  the  support  counting  as  shown 
in  Figure  7.11. 

7.5.4     Constraint  Relaxation 

The  negative  border  idea  can  be  used  to  handle  some  cases  of  constraint  relaxation 
also,  especially  relaxations  to  the  external  constraints  and  the  frequency  constraint.  For 
instance,  in  the  example  in  Section  7.5.2,  if  we  relax  the  constraint  on  the  sales  transaction 
to  sales. .Total. pr ice  >  300,  it  can  be  cast  as  an  incremental  mining  problem.  In  this  case, 
the  increment  dataset  would  be  the  transactions  with  Total.price  between  300  and  500. 
Note  that,  for  this  to  work  the  relaxed  constraint  should  subsume  the  initial  constraint. 
Relaxations  to  the  frequency  constraint  involves  two  cases.  In  cases  where  the  frequency 
threshold  is  increased,  it  is  straight-forward  to  find  the  new  frequent  itemsets  by  just  filtering 
the  itemsets  with  the  new  threshold.  On  the  other  hand,  if  the  frequency  threshold  is 
lowered  we  need  to  use  an  approach  similar  to  the  incremental  mining  algorithm  where  we 
compute  the  promoted  negative  borders  and  candidate  closure  to  determine  if  the  support 
of  any  new  itemset  needs  to  be  found. 
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7.6     Applicability  Beyond  Association  Mining 

The  incremental  algorithm  we  have  proposed  makes  use  of  the  closure  property  of 
frequent  itemsets.  In  this  section,  we  discuss  how  the  incremental  approach  can  be  gen- 
eralized to  other  data  mining  and  decision  support  problems.  In  Section  7.6.1,  we  discuss 
the  applicability  of  the  incremental  algorithm  to  mine  closed  sets.  Generalizations  to  in- 
cremental evaluation  of  query  flocks  and  certain  kinds  of  materialized  views  are  described 
in  Sections  7.6.2  and  7.6.3  respectively. 

7.6.1     Mining  Closed  Sets 

All  the  efficient  algorithms  for  mining  association  rules  exploits  the  closure  property 
of  frequent  itemsets.  Minimum  support  which  characterizes  frequent  itemsets  is  downward 
closed:  if  an  itemset  has  minimum  support  then  all  its  subsets  also  have  minimum  support. 
The  idea  of  negative  border  can  be  used  for  all  incremental  mining  problems  that  possess 
closure  properties.  If  the  closure  property  is  incrementally  updatable  (for  instance  support) 
also,  it  is  possible  to  limit  the  database  access  to  at  most  one  scan  of  the  whole  database 
as  shown  in  the  incremental  frequent  itemset  mining  example.  A  property  is  incrementally 
updatable  if  it  is  possible  to  derive  the  value  of  it  from  the  corresponding  values  of  different 
partitions  of  the  input  data.  A  few  examples  are  COUNT,  SUM,  MIN,  MAX  and  so  on.  We 
list  below  a  few  other  data  mining  problems  that  have  closure  properties. 

1.  Sequential  patterns:  In  sequential  pattern  mining,  the  frequent  patterns  are  closed 
with  respect  to  minimum  support,  much  like  the  frequent  itemsets.  The  support  is 
incrementally  updatable. 

2.  Correlation  rules:  Correlation  rules  [25]  are  upward  closed  with  respect  to  correlation; 
that  is,  if  a  set  of  items  S  is  correlated,  so  is  every  superset  of  S.   A  set  of  items  is 
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said  to  be  correlated  if  any  of  its  subsets  are  dependent.  For  efficient  evaluation 
of  correlation  rules,  a  notion  of  support  is  introduced  in  Brin  et  al.  [25],  which  is 
downward  closed.  However,  these  properties  are  not  incrementally  updatable. 

3.  Consequent  part  of  association  rules:  For  any  frequent  itemset,  if  a  rule  with  conse- 
quent c  holds  (that  is,  it  has  minimum  confidence)  then  so  do  rules  with  consequents 
that  are  subsets  of  c  [9];  that  is,  such  rules  are  downward  closed.  Note  that  the 
confidence  is  not  incrementally  updatable. 

4.  Maximal  frequent  itemsets:  Random  walk  algorithms  [58]  which  works  by  generating 
a  series  of  random  walks  along  the  itemset  lattice  exploits  the  downward  closure 
property  of  maximal  frequent  itemsets.  Support  which  characterizes  frequent  itemsets 
in  this  case  is  also  incrementally  updatable. 

7.6.2     Query  Flocks 

The  boolean  association  rules  was  first  defined  in  the  context  of  market  basket  data 
and  it  has  been  generalized  to  mine  associations  across  relational  tables  expressed  as  query 
flocks  [128].  A  query  flock  is  a  parameterized  query  with  a  filter  condition  to  eliminate  the 
values  of  parameters  that  are  "uninteresting" .  For  example,  let  us  assume  that  we  want 
to  evaluate  the  mining  query  "What  are  the  interesting  drivers  that  caused  customers  to 
buy  the  widgets  from  a  catalog  ?" .  A  driver  is  deemed  "interesting"  if  it  has  caused  at 
least  10  customers  to  buy  the  widget.  Let  the  data  be  stored  in  a  set  of  relational  tables, 
namely,  catalog(widget,  manufacturer),  sale(customer,  widget),  driver(customer,  widget, 
driver).  The  above  query  can  be  written  as  a  query  flock  in  Datalog  as  shown  below.  The 
filter  condition  prunes  out  values  which  do  not  have  minimum  support.   In  Section  7.6.2, 
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we  discuss  how  the  negative  border  idea  can  be  used  for  efficient  incremental  evaluation  of 
query  flocks. 

QUERY: 

answer(C)  :- 

sale(C,  $W)  AND 
driver (C,  $W,  $D)  AND 
catalog ($W,  manufacturer) 
FILTER: 

COUNT (C)  >  10 

Incremental  Evaluation  of  Query  Flocks 

Applying  the  apriori  technique  for  evaluating  the  above  query  flock  will  result  in  the 
following  safe  subqueries  [128]. 

1.  Q\\  answer(C)    :-  sale(C,   $W) 

2.  Q2:  answer(C)    :-  driver(C,    $W,    $D) 

3.  Q3:  answer(C)    :-  sale(C,   $W)   AND  driver(C,   $W,   $D) 

4.  Q4:  answer (C)    :-  sale(C,   $W)   AND  driver (C,   $W,   $D)   AND 
catalog ($W,  manufacturer) 

The  query  flocks  corresponding  to  the  safe  subqueries  form  a  lattice  with  query  containment 
as  the  partial  order  and  the  original  query  flock  as  the  top  element.  That  is,  if  Q,  and  Q2 
are  elements  of  the  lattice,  Qx  <  Q2  ^  the  result  of  Q2  is  contained  in  the  result  of  Qx. 

During  the  execution  of  the  subqueries  of  the  query  flock,  all  records  with  parameter 
values  which  satisfy  the  filter  condition  are  propagated  to  the  next  higher  subquery  in  the 
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lattice  for  further  evaluation.  For  example,  after  Qi  is  evaluated  all  the  records  corre- 
sponding to  widgets  that  are  bought  by  at  least  10  customers  will  be  piped  to  Q3  which  is 
immediately  above  Q\  in  the  subquery  lattice.  The  parameter  value  combinations  in  this 
case  are  analogous  to  itemsets  in  boolean  association  rule  mining. 

The  negative  border  in  this  apriori-based  query  flock  evaluation  is  the  set  of  parameter 
value  combinations  that  does  not  pass  the  filter  condition  test.  These  combinations  if 
materialized  and  stored  along  with  their  support  counts  can  be  effectively  used  to  update 
the  result  of  the  query  flock  when  the  base  tables  over  which  it  is  defined  are  updated. 
We  first  check  if  any  of  the  records  in  the  negative  border  pass  the  filter  condition  as  a 
result  of  changes  to  the  base  tables.  This  can  be  done  starting  at  the  bottom  element 
of  the  lattice  and  proceeding  upwards,  propagating  the  deltas  corresponding  to  the  new 
combinations.  If  none  of  the  records  in  the  negative  border  passes  the  filter  condition,  we 
do  not  have  to  evaluate  the  subqueries  for  the  lattice  elements  above.  The  base  tables 
may  have  to  be  looked  up  (for  example,  to  be  joined  with  the  deltas)  depending  upon  the 
subquery  representing  the  lattice  element.  It  is  also  possible  to  evaluate  the  filter  condition 
for  all  the  lattice  elements'  negative  borders  together.  However,  this  might  involve  more 
computations  than  propagating  the  deltas  through  the  lattice. 

7.6.3     View  Maintenance 

Incremental  mining  can  also  be  seen  as  materialized  view  maintenance.  In  boolean 
association  rules,  the  frequent  itemsets  and  the  negative  border  are  in  fact  aggregate  views 
over  the  transaction  table.  In  query  flocks  each  element  in  the  subquery  lattice  can  be 
considered  as  a  view  defined  on  the  base  tables.  Therefore,  view  maintenance  techniques 
similar  to  the  ones  in  Mumick  et  al.  [94]  can  be  used  for  incremental  mining.  On  the  other 
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hand,  the  negative  border  based  change  propagation  can  also  be  applied  for  the  maintenance 
of  views  involving  monotone  aggregate  functions  that  satisfy  the  apriori  subset  property. 
For  example  if  the  view  definition  has  a  filter  condition  such  as  the  SQL  "having"  clause, 
it  could  be  beneficial  to  also  store  the  records  which  does  not  pass  the  filter  condition  test 
(same  as  the  negative  border).  When  the  base  tables  are  updated,  these  records  will  be 
useful  to  maintain  the  view  rather  than  re-materializing  it. 

An  itemset  can  also  be  treated  as  a  point  in  the  data  cube  [55]  defined  by  the  items  as 
dimensions  and  support  as  the  measure.  The  data  cube  points  can  be  arranged  as  a  lattice 
according  to  the  partial  order  on  the  itemsets.  In  that  case,  the  data  cube  maintenance 
algorithms  similar  to  that  of  Roussopoulos  et  al.  [107]  are  also  applicable  here.  However, 
this  approach  might  not  be  viable  in  cases  consisting  of  large  number  of  items. 

7.7     Summary 

In  this  chapter,  we  developed  an  incremental  algorithm  for  mining  frequent  itemsets. 
We  developed  SQL  formulations  for  the  incremental  approach  based  on  a  few  representative 
approaches  from  Chapters  3  and  4.  We  also  show  how  the  incremental  algorithm  can 
be  generalized  to  handle  certain  kinds  of  constraints  and  constraint  relaxations  and  its 
applicability  to  other  mining  problems. 


CHAPTER  8 
CONCLUSIONS 


We  analyzed  various  architectural  alternatives  for  integrating  mining  with  a  relational 
database  system.  We  studied  various  mining  algorithms  with  the  twin  goals  of  finding 
the  trade-offs  between  the  architectural  options  and  the  extensions  needed  in  a  DBMS  to 
efficiently  support  mining.  First,  we  experimented  with  different  ways  of  implementing  the 
association  rule  mining  algorithm  in  SQL  to  find  if  it  is  at  all  possible  to  get  competitive 
performance  out  of  SQL  implementations. 

We  considered  two  categories  of  SQL  implementations.  We  experimented  with  four 
different  implementations  based  purely  on  SQL-92.  We  picked  the  best  SQL-92  implemen- 
tation for  further  analysis.  We  developed  cost  formulae  for  different  execution  plans  based 
on  the  input  data  parameters  and  the  relational  operator  costs.  This  cost  analysis  provides 
a  basis  for  incorporating  the  semantics  of  mining  algorithms  into  future  query  optimizers. 
We  next  experimented  with  a  collection  of  approaches  that  made  use  of  the  new  object- 
relational  extensions  like  UDFs,  BLOBs,  and  table  functions.  With  this  extended  SQL  we 
got  significant  performance  improvements  over  the  SQL-92  based  implementations. 

We  compared  the  SQL  implementation  with  different  architectural  alternatives.  We 
observed  that  based  just  on  performance  the  Cache-Mine  approach  is  the  winner.  A  close 
second  is  the  SQL-OR  approach  that  was  sometimes  slightly  better  than  Cache-Mine  and 
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was  never  worse  than  a  factor  of  two  on  the  real-life  datasets.  Both  these  approaches  re- 
quire additional  space  for  caching,  however.  The  Stored-procedure  approach  does  not 
require  any  extra  space  (except  possibly  for  initially  sorting  the  data  in  the  DBMS)  and 
can  perhaps  be  made  to  be  within  a  factor  of  two  to  three  of  Cache-Mine  with  the  re- 
cent algorithms  [26,  127].  The  UDF  approach  is  about  a  factor  of  two  faster  than  the 
Stored-procedure  approach  but  is  significantly  harder  to  code.  The  SQL  approach  offers 
some  secondary  advantages  like  easier  development  and  maintenance  and  potential  for  au- 
tomatic parallelization.  However,  it  might  not  be  as  portable  as  the  Cache-Mine  approach 
across  different  database  management  systems. 

Next,  we  wanted  to  find  out  if  it  is  possible  to  handle  other  more  complex  mining  tasks 
within  the  same  SQL  framework.  We  studied  generalized  association  rules  and  sequential 
patterns  with  the  goal  of  showing  that  the  association  rule  framework  can  be  extended  easily 
for  these  mining  problems  as  well.  We  developed  several  SQL  formulations  for  generalized 
association  rule  and  sequential  pattern  mining;  some  of  them  by  extending  the  association 
rule  approaches.  The  major  addition  for  generalized  association  rule  was  to  "extend"  the 
input  transaction  table  (transform  T  to  T*).  For  sequential  patterns,  the  join  predicates 
for  candidate  generation  and  support  counting  were  significantly  different.  We  conducted 
some  experiments  on  real-life  datasets  and  found  that  the  performance  observations  hold 
here  also. 

In  order  to  handle  the  volume  of  data  stored  in  present  day  data  warehouses,  which 
keeps  streaming  in,  it  is  important  to  develop  incremental  mining  algorithms.  We  developed 
an  incremental  association  rule  mining  algorithm  which  does  not  need  to  examine  the  old 
data  if  the  frequent  itemsets  do  not  change.  Even  in  cases  where  new  frequent  itemsets  are 
added,  access  to  the  old  database  can  be  limited  to  just  one  scan. 
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In  the  context  of  goal-oriented  mining,  we  often  have  to  mine  associations  with  various 
kinds  of  constraints.  We  identify  and  categorize  the  different  constraints  and  show  how  they 
can  be  processed  in  the  relational  framework.  The  incremental  approach  can  also  be  used 
to  handle  certain  kinds  of  constraint  relaxation.  We  develop  a  general  framework  for  the 
incremental  approach  to  mine  associations  with  constraints  having  the  downward  closure 
property. 

The  concept  of  negative  border  which  is  the  key  to  the  incremental  algorithm  has 
other  applications  also.  It  can  be  used  for  mining  association  rules  with  varying  support 
and  confidence  values.  For  instance,  the  negative  border  can  be  used  to  determine  the 
updated  frequent  itemsets  if  the  support  is  changed.  If  the  support  is  increased  it  is  trivial 
to  update  the  frequent  itemsets.  However,  if  the  support  is  lowered  the  itemsets  in  the 
promoted  negative  border  can  be  used  to  determine  if  the  support  of  any  new  itemset 
needs  to  be  counted  in  a  similar  manner  to  incremental  mining.  This  could  be  quite  useful 
in  cases  where  determining  the  "correct"  support  value  is  difficult.  Initially  the  frequent 
itemsets  for  an  approximate  support  can  be  computed  which  is  further  refined  based  on 
user  feedback. 

We  further  show  the  applicability  of  the  incremental  algorithm  to  certain  classes  of 
data  mining  and  decision  support  problems. 

In  Section  8.1,  we  identify  certain  extensions  to  the  current  database  management  sys- 
tems that  are  useful  for  mining.  We  enumerate  the  specific  contributions  of  this  dissertation 
in  Section  8.2  and  in  Section  8.3,  we  list  possibilities  for  further  research. 
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8.1     Proposed  Extensions 

One  of  the  goals  of  our  work  has  been  to  "unbundle"  complex  mining  operations  like 
associations  and  classifications  into  smaller  primitives  that  can  be  supported  efficiently  by 
a  general  purpose  DBMS  and  be  useful  for  a  large  class  of  mining  algorithms.  Our  exercise 
with  some  of  the  mining  operations  has  helped  us  identify  a  few  such  primitives  that  we 
believe  could  be  generally  useful  in  a  number  of  other  decision  support  applications. 

8.1.1     Richer  Set  Operations: 

We  expect  a  richer  collection  of  set  operations  to  be  useful  for  mining.  For  the  mining 
problems  addressed  in  this  dissertation,  we  used  three  different  versions  of  the  basic  subset 
operation.  For  candidate  generation  we  needed  to  find  all  k  —  1  subsets  of  a  set  of  k  elements 
for  doing  the  pruning.  We  enumerated  the  subsets  explicitly  in  the  query  predicate  because 
the  size  of  the  set  and  the  subsets  was  fixed  and  known  in  advance.  For  support  counting, 
we  used  the  Comb-K  table  function  which  generated  all  k  subsets  of  a  variable  length  set. 
For  rule  generation,  we  used  the  GenSubset  table  function  to  generate  all  subsets  of  a 
variable  length  set.  The  subset  operation  would  also  be  useful  in  building  decision  trees  on 
categorical  attributes  [85]. 

Another  important  set  operation  was  the  intersect  operation  used  in  the  Vertical 
approach.  Most  systems  already  have  internal  implementations  of  this  operation  for  doing 
AND  and  OR  operations  on  RID  (record  identifier)  lists  obtained  during  multiple  index 
scans  [93].  In  current  OLAP  and  data  warehousing  systems  this  operation  is  rampant  in 
the  popular  bit-mapped  indices.  Our  Vertical  approach  is  curiously  similar  to  this  bit- 
mapped approach  with  one  important  difference.  Instead  of  performing  ANDs  on  RIDs,  we 
perform  the  ANDs  on  another  attribute,  the  transaction  identifier. 
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We  also  used  the  set  difference  operation  for  pruning  itemsets  containing  an  item  and 
its  ancestor,  in  generalized  association  rules. 

8.1.2     Enhanced  Aggregation 

Another  common  operation  that  we  used  is  the  Gather  operation  that  can  transform 
two  attributes  in  a  data  table  to  a  form  where  for  distinct  values  of  one  of  the  attributes 
(called  the  grouping  attribute)  we  collect  together  in  a  set  all  values  of  the  other  attribute. 
We  can  think  of  Gather  as  a  glorified  aggregate  function.  Its  reverse  operation  scatter  is 
a  special  case  of  the  subset  operation  where  the  subset  size  is  1. 

We  used  this  operation  in  four  of  our  SQL  approaches:  Gather  Join,  GatherPrune, 
Vertical  and  Horizontal.  To  implement  the  Gather  table  function  in  our  queries  we 
required  data  to  appear  in  a  particular  order  as  inputs  to  the  table  function.  This  would  be 
trivial  to  express  in  SQL  if  order  by  clauses  were  allowed  in  inner  subqueries.  In  the  absence 
of  such  a  feature,  we  relied  on  our  knowledge  of  DB2's  internals  to  force  such  an  order  in 
our  experiments.  The  Gather  function  can  also  be  treated  as  an  aggregate  function  that 
simply  concatenates  its  arguments.  Systems  that  have  support  for  user-defined  aggregate 
functions  [122]  can  easily  support  such  a  functionality. 

There  are  several  other  decision  support  applications  where  gather  could  be  extremely 
useful.  For  instance,  suppose  we  are  trying  to  classify  customers  of  a  credit  card  company 
and  the  data,  in  addition  to  static  customer  attributes  like  age  and  salary,  consists  of 
detailed  transaction  data  about  each  purchase  activity  of  a  customer.  In  this  case,  we 
would  like  to  gather  all  activities  of  a  customer  as  one  time  series  and  extract  features 
from  these  time  series  for  classification.  In  OLAP  applications  too,  data  often  needs  to  be 
converted  back  and  forth  from  a  format  where  different  measures  are  gathered  together  as 
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different  attributes  of  the  same  row  to  a  format  where  different  measures  are  scattered  as 
different  rows. 

8.1.3  Multiple  Streams 

In  current  relational  DBMSs,  there  is  only  a  single  stream  of  control  along  which  data 
flows.  Support  for  multiple  streams,  where  the  output  of  an  operation  can  be  piped  to 
more  than  one  subsequent  operations  will  be  useful  for  a  large  class  of  mining  and  decision 
support  operations.  In  the  SQL  formulation  of  incremental  mining,  we  show  how  it  will  be 
useful  for  counting  the  support  of  multiple-sized  candidates  in  one  pass.  This  idea  can  be 
used  for  association  rule  mining  also  to  reduce  the  number  of  passes  utilizing  some  of  the 
newer  algorithms  [26,  127]. 

In  classification,  we  have  to  compute  the  class  distribution  corresponding  to  differ- 
ent split  points  and  attribute  value  combinations.  This  will  require  multiple  scans  using 
standard  SQL.  The  Unpivot  operator  proposed  in  Graefe  et  al.  [54]  attempts  to  circumvent 
the  problem  of  multiple  scans.  However,  support  for  multiple  streams  will  make  it  much 
simpler  for  doing  it  in  one  scan  of  the  database.  In  the  OLAP  world  also  this  will  be  useful 
for  computing  the  datacube  [56]  and  simultaneous  multidimensional  aggregates  [1,  133]. 

8.1.4  Sampling 

Sampling  is  a  useful  technique  for  scaling  up  algorithms  that  handle  large  volumes  of 
data.  The  use  of  sampling  for  query  size  estimation,  histogram  construction,  computing 
quantiles  and  so  on  has  been  addressed  in  several  research  papers  [35,  59]. 

Sampling  can  be  used  to  get  quick  and  good  approximate  answers  to  the  mining  and 
decision  support  queries.  The  algorithms  can  be  first  run  on  a  sample  to  give  the  user 
approximate  answers  based  on  which  he/she  can  decide  whether  to  run  it  on  the  whole 
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dataset  and  also  to  fine  tune  input  parameters.  There  are  several  mining  algorithms  which 
use  sampling  [44,  127]. 

8.2     Contributions l 

In  this  dissertation,  we  have  addressed  the  following  problems. 

1.  Analyze  the  different  database  integration  alternatives  for  data  mining. 

2.  Develop  and  implement  various  SQL-based  approaches  for  association  rule  mining. 

3.  Study  the  performance  profile  of  current  DBMSs  to  execute  the  above  SQL  queries. 

4.  Compare  the  different  database  integration  architectures  quantitatively  and  qualita- 
tively. 

5.  Develop  cost  formulae  for  the  SQL  approaches  based  on  input  data  parameters  and 
relational  operator  costs.  These  provide  some  insights  into  enhancing  current  cost- 
based  optimizers  to  incorporate  the  domain-specific  semantics  of  mining  algorithms. 

6.  Extend  the  association  rule  framework  for  mining  generalized  association  rules  and 
sequential  patterns. 

7.  Develop  an  incremental  association  rule  mining  algorithm. 

8.  SQL  formulations  of  the  incremental  algorithm  and  its  generalization  to  handle  con- 
straints. 

9.  Generalize  the  incremental  algorithm  for  constrained  association  mining  and  demon- 
strate its  applicability  to  a  larger  class  of  data  mining  and  decision  support  problems. 


'The  work  corresponding  to  the  first  four  items  was  primarily  done  by  researchers  from  IBM  Almaden 
Research  Center  and  the  author  was  a  contributor.  For  the  remaining  items,  the  author  was  the  primary 
contributor.  The  file-based  incremental  association  rule  mining  algorithm  (item  7)  was  developed  primarily 
by  the  author  as  part  of  the  Introduction  to  Parallel  Computing  course  project. 
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10.  Explore  primitive  operators  to  support  mining  and  decision  support  in  databases. 

8.3     Future  Work 

The  work  presented  in  this  dissertation  points  to  several  directions  for  future  research. 
A  natural  next  step  is  to  experiment  with  other  kinds  of  mining  operations  such  as  classi- 
fication and  clustering  [45],  to  verify  if  our  conclusions  hold  for  these  other  cases  too.  It  is 
also  important  to  derive  a  set  of  primitive  operators  with  which  the  different  mining  and 
decision  support  operations  can  be  composed.  The  operators  we  have  identified  provide 
some  headway  in  that  direction.  Another  useful  direction  is  to  explore  what  kind  of  a  sup- 
port is  needed  for  answering  short,  interactive,  ad  hoc  queries  involving  a  mix  of  mining  and 
relational  operations.  The  success  of  SQL  as  the  most  popular  data  management  language 
can  be  mainly  attributed  to  its  ad  hoc  query  support.  In  the  mining  context,  what  is  an 
ad  hoc  mining  query?  Is  it  just  mining  computations  expressed  with  some  constraints  on 
the  result?  How  much  support  can  we  leverage  from  existing  relational  engines  for  min- 
ing? What  data  model  and  language  extensions  are  needed?  Does  the  execution  of  these 
operators  involve  selective  materialization  and  incremental  evaluation  techniques?  Some  of 
these  questions  are  orthogonal  to  whether  the  bulky  mining  operations  are  implemented 
using  SQL  or  not.  Nevertheless,  these  are  important  in  providing  the  analysts  with  a  well 
integrated  platform  where  mining  and  relational  operations  can  be  inter-mixed  in  useful 
and  flexible  ways. 

8.4     Closing 

In  the  coming  years,  database/data  warehouse  systems  will  be  called  upon  to  deliver 
return-on-investment  in  the  form  of  nuggets  of  knowledge  or  better  insights  about  the  huge 
volumes  of  data  they  store  and  manage.  In  order  to  efficiently  support  the  multiplicity  of 
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data  mining  and  decision  support  operations  on  the  same  data,  tighter  integration  of  these 
operations  with  database  systems  will  be  required. 

This  dissertation  presents  strategies  aimed  at  tighter  database  integration  of  mining 
and  identifies  optimizations  and  primitive  operators  to  make  database  systems  a  better 
platform  for  mining  and  decision  support. 
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